Speech to Text

Rhasspy's primary function is convert voice commands to JSON events. The first step of this process is converting speech to text (transcription).

The following table summarizes language support for the various speech to text systems:

System en de es fr it nl ru el hi zh vi pt ca
pocketsphinx
kaldi

Pocketsphinx

Does speech recognition with CMU's pocketsphinx. This is done completely offline, on your device. If you experience performance problems (usually on a Raspberry Pi), consider running on a home server as well and have your client Rhasspy use a remote HTTP connection.

Add to your profile:

"speech_to_text": {
  "system": "pocketsphinx",
  "pocketsphinx": {
    "acoustic_model": "acoustic_model",
    "base_dictionary": "base_dictionary.txt",
    "custom_words": "custom_words.txt",
    "dictionary": "dictionary.txt",
    "language_model": "language_model.txt",
    "unknown_words": "unknown_words.txt",
    "mllr_matrix": "acoustic_model_mllr"
  }
}

The dictionary, language_model, and unknown_words files are written during training by the default speech to text training system. The acoustic_model and base_dictionary components for each profile were taken from a set of pre-trained models. Anyone can extend Rhasspy to new languages by training a new acoustic model.

When Rhasspy starts, it creates a pocketsphinx decoder with the following attributes:

  • hmm - speech_to_text.pocketsphinx.acoustic_model (directory)
  • dict - speech_to_text.pocketsphinx.dictionary (file)
  • lm - speech_to_text.pocketsphinx.language_model (file)
  • mllr - speech_to_text.pocketsphinx.mllr_matrix (file, optional)

The mllr_matrix file is intended for advanced users who want to tune/adapt their acoustic models. This can increase the performance of Rhasspy's speech recognition for a specific user/microphone/acoustic environment.

Pocketsphinx allows Rhasspy to support English (en), German (de), Dutch (nl), Spanish (es), Italian (it), French (fr), Greek (el), Russian (ru), Hindi (hi), and Mandarin (zh).

See rhasspy.stt.PocketsphinxDecoder for details.

Kaldi

Does speech recognition with Kaldi. This is done completely offline, on your device. If you experience performance problems (usually on a Raspberry Pi), consider running on a home server as well and have your client Rhasspy use a remote HTTP connection.

{
  "speech_to_text": {
    "system": "kaldi",
    "kaldi": {
        "base_dictionary": "base_dictionary.txt",
        "compatible": true,
        "custom_words": "custom_words.txt",
        "dictionary": "dictionary.txt",
        "graph": "graph",
        "kaldi_dir": "/opt/kaldi",
        "language_model": "language_model.txt",
        "model_dir": "model",
        "unknown_words": "unknown_words.txt"
    }
  }
}

Kaldi allows Rhasspy to support Vietnamese (vi) and Portuguese (pt) thanks to pre-trained models from the folks from the Montreal Forced Aligner.

This requires Kaldi to be installed, which is...challenging. The Docker image of Rhasspy contains a pre-built copy of Kaldi, which might work for you outside of Docker. Make sure to set kaldi_dir to wherever you installed Kaldi.

Rhasspy expects a Kaldi-compatible profile to contain a model directory with a train.sh and decode.sh script. See the Vietnamese (vi) or Portuguese (pt) profile for an example.

Remote HTTP Server

Uses a remote HTTP server to transform speech (WAV) to text. The /api/speech-to-text endpoint from Rhasspy's HTTP API does just this, allowing you to use a remote instance of Rhasspy for speech recognition. This is typically used in a client/server set up, where Rhasspy does speech/intent recognition on a home server with decent CPU/RAM available.

Add to your profile:

"speech_to_text": {
  "system": "remote",
  "remote": {
    "url": "http://my-server:12101/api/speech-to-text"
  }
}

During speech recognition, 16-bit 16 kHz mono WAV data will be POST-ed to the endpoint with the Content-Type set to audio/wav. A text/plain response with the transcription is expected back. An additional profile query argument is sent with the current profile name, so the POST URL is effectively something like http://remote-server:12101/api/speech-to-text?profile=en.

See rhasspy.stt.RemoteDecoder for details.

Command

Calls a custom external program to do speech recognition.

Voice commands are normally transcribed by Rhasspy using pocketsphinx. The acoustic model, dictionary, and language model are available in your profile directory (after training) as acoustic_model/, dictionary.txt, and language_model.txt respectively. You can call a custom program to do speech to text that uses these artifacts or does something totally different!

Add to your profile:

"speech_to_text": {
  "system": "command",
  "command": {
    "program": "/path/to/program",
    "arguments": []
  }
}

When a voice command is received, Rhasspy will call your program and push the recorded WAV data (16-bit 16 kHz mono) to standard in. Your program should print the text transcription to standard out.

The following environment variables are available to your program:

  • $RHASSPY_BASE_DIR - path to the directory where Rhasspy is running from
  • $RHASSPY_PROFILE - name of the current profile (e.g., "en")
  • $RHASSPY_PROFILE_DIR - directory of the current profile (where profile.json is)

See speech2text.sh for an example program.

If your speech to text system requires some kind of custom training, you should also override Rhasspy's speech to text training system.

See rhasspy.stt.CommandDecoder for details.

Dummy

Disables speech to text decoding.

Add to your profile:

"speech_to_text": {
  "system": "dummy"
}

See rhasspy.stt.DummyDecoder for details.