Simon/Tips, Tricks and Best Practices

From KDE UserBase Wiki

Recordings

Because Simon, when using an user generated model, creates a speech model specifically for each user, the trainings corpus is one of the most important parts of achieving good recognition rates.

This section contains a couple of frequently made mistakes when recording training utterances and possible solutions.

Loudness

If you did not use your microphone for Simon before, please double-check that its volume is set to an appropriate level.

Louder is usually better. However, your microphone should never clip. That means you better start out low and increase your level step by step until it reaches the maximum amplitute when speaking loudly (you can check the current amplitute with e.g. Audacity).

Do not, however, "boost" the volume artificially (this is often represented as increasing the volume over 100% or activating a toggle called "Mic Boost"). These options only make the signal more pronounced but do not introduce new information. This doesn't help (and can even hurt) recognition rates.

Newer versions of simon include a level-meter which is displayed while recording samples. It will tell you if your volume is set up correctly.

Pauses

Simon tries to learn the pronunciation of its users. But of course Simon does never really hear what the user is saying - it also gets all of the environment noise.

That is why Simon must also learn how what we define as "silence" sounds. This varies by your environment but also by the microphone that you are using.

Simon treats everything at the beginning and at the end of the sample as "silence". For that to work, it is best if the user leaves about one or two seconds of silence at the beginning and end of each recording.

Echo Cancellation

If you are on PulseAudio on Linux (if you don't know and use a reasonably modern distribution, there is a high chance that you are), you can enable echo cancellation to improve recognition accuracy in the face of background noise. Put simply, echo cancellation tries to "substract" the sound that is played through the computer's speakers from the microphone input, therefore allowing you to use Simon while you are e.g., playing music. In practice this is not a perfect process (it's tricky doesn't behave very nicely because of room reflections, for example.) but if you are having trouble with background noise it may help.

To enable it, run the following command in the terminal to enable PulseAudio's echo cancellation module:

pactl load-module module-echo-cancel

Then, open pavucontrol, the PulseAudio mixer and you will see new output and input devices. Re-route your playback application (for example your music-player) to use the new device ("<devicename> (echo cancelled with <device>)") and re-route Simon's recording stream to use the equivalent input device. Echo cancellation is now active.

Language Model

This section covers tips and tricks regarding your language model as a whole.

Transcriptions

The next time you want to add a word which has no transcription in your shadow dictionary (like "geöffnet") switch back to the Simon main window (you can keep the add word wizard open of course) and open up the shadow dictionary.

Using the filter edit field we can quickly look for parts of the word we are looking to transcribe. In our case, let's look for "öffnet" - we get the transcription: "oe f n @ t".

Now look for "ge" and you will see a lot of words containing "ge". The first hit would be "abartige" with the transcription "a p a: r t I g @". As the SAMPA is very easy to read we can quickly see that "ge" (which sounds the same in "abartige" and "geöffnet" so the SAMPA is also the same) of the word "abartige" is transcribed as "g @".

All that is left to do is put those two parts together and we get: "g @ oe f n @ t"

Use this technique to easily transcribe words that are not in the shadow dictionary.