Reference

Supported Languages

The table below lists which components and compatible with Rhasspy's supported languages.

Category Name Offline? ca cs de fr el en es hi it nl pl pt ru sv vi zh
Wake Word raven
pocketsphinx
precise
porcupine
snowboy requires account
Speech to Text pocketsphinx
kaldi
deepspeech
Intent Recognition fsticuffs
fuzzywuzzy
adapt
flair
rasaNLU needs extra software
Text to Speech espeak
flite
picotts
nanotts
marytts needs extra software
opentts needs extra software
larynx
wavenet

• - yes, but requires training/customization


MQTT API

Rhasspy implements a superset of the Hermes protocol in rhasspy-hermes for the following components:

Audio Server

Messages for audio input and audio output.

  • hermes/audioServer/<siteId>/audioFrame (binary)
    • Chunk of WAV audio data for site
    • wav_bytes: bytes - WAV data to play (message payload)
    • siteId: string - Hermes site ID (part of topic)
  • hermes/audioServer/<siteId>/<sessionId>/audioSessionFrame (binary)
    • Chunk of WAV audio data for session
    • wav_bytes: bytes - WAV data to play (message payload)
    • siteId: string - Hermes site ID (part of topic)
    • sessionId: string - session ID (part of topic)
  • hermes/audioServer/<siteId>/playBytes/<requestId> (JSON)
    • Play WAV data
    • wav_bytes: bytes - WAV data to play (message payload)
    • requestId: string - unique ID for request (part of topic)
    • siteId: string - Hermes site ID (part of topic)
    • Response(s)
  • hermes/audioServer/<siteId>/playFinished
  • hermes/audioServer/toggleOff (JSON)
    • Disable audio output
    • siteId: string = "default" - Hermes site ID
  • hermes/audioServer/toggleOn (JSON)
    • Enable audio output
    • siteId: string = "default" - Hermes site ID
  • hermes/error/audioServer/play (JSON, Rhasspy only)
    • Sent when an error occurs in the audio output system
    • error: string - description of the error
    • context: string? = null - system-defined context of the error
    • siteId: string = "default" - Hermes site ID
    • sessionId: string? = null - current session ID
  • hermes/error/audioServer/record (JSON, Rhasspy only)
    • Sent when an error occurs in the audio input system
    • error: string - description of the error
    • context: string? = null - system-defined context of the error
    • siteId: string = "default" - Hermes site ID
    • sessionId: string? = null - current session ID
  • rhasspy/audioServer/getDevices (JSON, Rhasspy only)
    • Request available input or output audio devices
    • modes: [string] - list of modes ("input" or "output")
    • id: string? = null - unique ID returned in response
    • siteId: string = "default" - Hermes site ID
    • test: bool = false - if true, test input devices
  • rhasspy/audioServer/devices (JSON, Rhasspy only)
    • Response to rhasspy/audioServer/getDevices
    • devices: [object] - list of available devices
      • mode: string - "input" or "output"
      • id: string - unique device ID
      • name: string? = null - human readable name for device
      • description: string? = null - detailed description of device
      • working: boolean? = null - true/false if test succeeded or not, null if not tested
    • id: string? = null - unique ID from request
    • siteId: string = "default" - Hermes site ID
  • rhasspy/audioServer/setVolume (JSON, Rhasspy only)
    • Set the volume at one or more sites
    • volume: float - volume level to set (0 = off, 1 = full volume)
    • siteId: string = "default" - Hermes site ID

Automated Speech Recognition

Messages for speech to text.

  • hermes/asr/toggleOn (JSON)
    • Enables ASR system
    • siteId: string = "default" - Hermes site ID
    • reason: string = "" - Reason for toggle on
  • hermes/asr/toggleOff (JSON)
    • Disables ASR system
    • siteId: string = "default" - Hermes site ID
    • reason: string = "" - Reason for toggle off
  • hermes/asr/startListening (JSON)
    • Tell ASR system to start recording/transcribing
    • siteId: string = "default" - Hermes site ID
    • sessionId: string? = null - current session ID
    • stopOnSilence: bool = true - detect silence and automatically end voice command (Rhasspy only)
    • sendAudioCaptured: bool = false - send audioCaptured after stop listening (Rhasspy only)
    • wakewordId: string? = null - id of wake word that triggered session (Rhasspy only)
  • hermes/asr/stopListening (JSON)
    • Tell ASR system to stop recording
    • Emits textCaptured if silence has was not detected earlier
    • siteId: string = "default" - Hermes site ID
    • sessionId: string = "" - current session ID
  • hermes/asr/textCaptured (JSON)
    • Successful transcription, sent either when silence is detected or on stopListening
    • text: string - transcription text
    • likelihood: float - confidence from ASR system
    • seconds: float - transcription time in seconds
    • siteId: string = "default" - Hermes site ID
    • sessionId: string? = null - current session ID
    • wakewordId: string? = null - id of wake word that triggered session (Rhasspy only)
    • asrTokens: [[object]]? = null - details of individual tokens (words) in captured text (also see ASR confidence, note that list is two levels deep)
      • value: string - text of the token
      • confidence: float - confidence score of token (0-1, 1 is more confident)
      • rangeStart: int - start index of token in input (0-based)
      • rangeEnd: int - end index of token in input (0-based)
      • time: object - structured time of when token was detected
        • start: float - start time in seconds (relative to start of utterance)
        • end: float - end time in seconds (relative to start of utterance)
  • hermes/error/asr (JSON)
    • Sent when an error occurs in the ASR system
    • error: string - description of the error
    • context: string? = null - system-defined context of the error
    • siteId: string = "default" - Hermes site ID
    • sessionId: string? = null - current session ID
  • rhasspy/asr/<siteId>/train (JSON, Rhasspy only)
  • rhasspy/asr/<siteId>/trainSuccess (JSON, Rhasspy only)
    • Indicates that training was successful
    • id: string? = null - unique ID from request (copied from train)
    • siteId: string - Hermes site ID (part of topic)
    • Response to rhasspy/asr/<siteId>/train
  • rhasspy/asr/<siteId>/<sessionId>/audioCaptured (binary, Rhasspy only)
    • WAV audio data captured by ASR session
    • siteId: string - Hermes site ID (part of topic)
    • sessionId: string - current session ID (part of topic)
    • Only sent if sendAudioCaptured = true in startListening

Dialogue Manager

Messages for managing dialogue sessions. These can be initiated by a hotword detected message (or /api/listen-for-command), and manually with a startSession message (or /api/start-recording).

  • hermes/dialogueManager/startSession (JSON)
  • hermes/dialogueManager/sessionStarted (JSON)
    • Indicates a session has started
    • sessionId: string - current session ID
    • siteId: string = "default" - Hermes site ID
    • customData: string? = null - user-defined data (copied from startSession)
    • Response to [hermes/dialogueManager/startSession]
  • hermes/dialogueManager/sessionQueued (JSON)
    • Indicates a session has been queued (only when init.canBeEnqueued = true in startSession)
    • sessionId: string - current session ID
    • siteId: string = "default" - Hermes site ID
    • customData: string? = null - user-defined data (copied from startSession)
    • Response to [hermes/dialogueManager/startSession]
  • hermes/dialogueManager/continueSession (JSON)
    • Requests that a session be continued after an intent has been recognized
    • sessionId: string - current session ID (required)
    • customData: string? = null - user-defined data (overrides session customData if not null)
    • text: string? = null - sentence to speak using text to speech
    • intentFilter: [string]? = null - valid intent names (null means all)
    • sendIntentNotRecognized: bool = false - send hermes/dialogueManager/intentNotRecognized if intent recognition fails
  • hermes/dialogueManager/endSession (JSON)
    • Requests that a session be terminated nominally
    • sessionId: string - current session ID (required)
    • text: string? = null - sentence to speak using text to speech
    • customData: string? = null - user-defined data (overrides session customData if not null)
  • hermes/dialogueManager/sessionEnded (JSON)
    • Indicates a session has terminated
    • termination: string reason for termination (required), one of:
      • nominal
      • abortedByUser
      • intentNotRecognized
      • timeout
      • error
    • sessionId: string - current session ID
    • siteId: string = "default" - Hermes site ID
    • customData: string? = null - user-defined data (copied from startSession)
    • Response to hermes/dialogueManager/endSession or other reasons for a session termination
  • hermes/dialogueManager/intentNotRecognized (JSON)
    • Sent when intent recognition fails during a session (only when init.sendIntentNotRecognized = true in startSession)
    • sessionId: string - current session ID
    • input: string? = null input to NLU system
    • siteId: string = "default" - Hermes site ID
    • customData: string? = null - user-defined data (copied from startSession)
  • hermes/dialogueManager/configure (JSON)
    • Sets the default intent filter for all subsequent dialogue sessions
    • intents: [object] - Intents to enable/disable (empty for all intents)
      • intentId: string - Name of intent
      • enable: bool - true if intent should be eligible for recognition
    • siteId: string = "default" - Hermes site ID
  • hermes/error/dialogueManager (JSON, Rhasspy only)
    • Sent when an error occurs in the dialogue manager system
    • error: string - description of the error
    • context: string? = null - system-defined context of the error
    • siteId: string = "default" - Hermes site ID
    • sessionId: string? = null - current session ID

Grapheme to Phoneme

Messages for looking up word pronunciations. See also the /api/lookup HTTP endpoint.

Words are usually looked up from a phonetic dictionary included with the ASR system. The current speech to text services handle these messages.

  • rhasspy/g2p/pronounce (JSON, Rhasspy only)
    • Requests phonetic pronunciations of words
    • words: [string] - words to pronounce (required)
    • id: string? = null - unique ID for request (copied to phonemes)
    • numGuesses: int = 5 - number of guesses if not in dictionary
    • siteId: string = "default" - Hermes site ID
    • sessionId: string? = null - current session ID
    • Response(s)
  • rhasspy/g2p/phonemes (JSON, Rhasspy only)
    • Phonetic pronunciations of words, either from a dictionary or grapheme-to-phoneme model
    • wordPhonemes: [object] - phonetic pronunciations (required), keyed by word, values are:
      • phonemes: [string] - phonemes for word (key)
      • guessed: bool? = null - true if pronunciation came from a grapheme-to-phoneme model, false if guessed with g2p model
    • id: string? = null - unique ID for request (copied from pronounce)
    • siteId: string = "default" - Hermes site ID
    • sessionId: string? = null - current session ID
    • Response to rhasspy/g2p/pronounce
  • rhasspy/error/g2p (JSON, Rhasspy only)
    • Sent when an error occurs in the G2P system
    • error: string - description of the error
    • context: string? = null - system-defined context of the error
    • siteId: string = "default" - Hermes site ID
    • sessionId: string? = null - current session ID

Hotword Detection

Messages for wake word detection. See also the /api/listen-for-wake HTTP endpoint and the /api/events/wake Websocket endpoint.

  • hermes/hotword/toggleOn (JSON)
    • Enables hotword detection
    • siteId: string = "default" - Hermes site ID
    • reason: string = "" - Reason for toggle on
  • hermes/hotword/toggleOff (JSON)
    • Disables hotword detection
    • siteId: string = "default" - Hermes site ID
    • reason: string = "" - Reason for toggle off
  • hermes/hotword/<wakewordId>/detected (JSON)
    • Indicates a hotword was successfully detected
    • wakewordId: string - wake word ID (part of topic)
    • modelId: string - ID of wake word model used (service specific)
    • modelVersion: string = "" - version of wake word model used (service specific)
    • modelType: string = "personal" - type of wake word model used (service specific)
    • currentSensitivity: float = 1.0 - sensitivity of wake word detection (service specific)
    • siteId: string = "default" - Hermes site ID
    • sessionId: string? = null - current session ID (Rhasspy only)
    • sendAudioCaptured: bool? = null - if not null, copied to asr/startListening message in dialogue manager
  • hermes/error/hotword (JSON, Rhasspy only)
    • Sent when an error occurs in the hotword system
    • error: string - description of the error
    • context: string? = null - system-defined context of the error
    • siteId: string = "default" - Hermes site ID
    • sessionId: string? = null - current session ID
  • rhasspy/hotword/getHotwords (JSON, Rhasspy only)
    • Request available hotwords
    • id: string? = null - unique ID for response
    • siteId: string = "default" - Hermes site ID
  • rhasspy/hotword/hotwords (JSON, Rhasspy only)
    • Response to rhasspy/hotword/hotwords
    • models: [object] - list of available hotwords
      • modelId: string - unique ID of hotword model
      • modelWords: string - words used to activate hotword
      • modelVersion: string = "" - version of hotword model
      • modelType: string = "personal" - "universal" or "personal"
    • id: string? = null - unique ID from request
    • siteId: string = "default" - Hermes site ID

Intent Handling

Messages for intent handling.

Natural Language Understanding

  • hermes/nlu/query (JSON)
    • Request an intent to be recognized from text
    • input: string - text to recognize intent from (required)
    • intentFilter: [string]? = null - valid intent names (null means all)
    • id: string? = null - unique id for request (copied to response messages)
    • siteId: string = "default" - Hermes site ID
    • sessionId: string? = null - current session ID
    • asrConfidence: float? = null - confidence from ASR system for input text
    • Response(s)
  • hermes/intent/<intentName> (JSON)
    • Sent when an intent was successfully recognized
    • input: string - text from query (required)
    • intent: object - details of recognized intent (required)
      • intentName: string - name of intent (required)
      • confidenceScore: float - confidence from NLU system for this intent (required)
    • slots: [object] = [] - details of named entities, list of:
      • entity: string - name of entity (required)
      • slotName: string - name of slot (required)
      • confidence: float - confidence from NLU system for this slot (required)
      • rawValue: string - entity value without substitutions (required)
      • value: object - entity value with substitutions (required)
        • value: any - entity value
      • range: object = null - indexes of entity value in text
        • start: int - start index
        • end: int - end index (exclusive)
    • id: string = "" - unique id for request (copied from query)
    • siteId: string = "default" - Hermes site ID
    • sessionId: string = "" - current session ID
    • customData: string = "" - user-defined data (copied from startSession)
    • asrTokens: [[object]]? = null - tokens from transcription
      • value: string - token value
      • confidence: float - confidence in token
      • range_start: int - start of token in input
      • range_end: int - end of token in input (exclusive)
    • asrConfidence: float? = null - confidence from ASR system for input text
    • Response to hermes/nlu/query
  • hermes/nlu/intentNotRecognized (JSON)
    • Sent when intent recognition fails
    • input: string - text from query (required)
    • id: string? = null - unique id for request (copied from query)
    • siteId: string = "default" - Hermes site ID
    • sessionId: string? = null - current session ID
    • Response to hermes/nlu/query
  • hermes/error/nlu (JSON)
    • Sent when an error occurs in the NLU system
    • error: string - description of the error
    • context: string? = null - system-defined context of the error
    • siteId: string = "default" - Hermes site ID
    • sessionId: string? = null - current session ID
  • rhasspy/nlu/<siteId>/train (JSON, Rhasspy only)
  • rhasspy/nlu/<siteId>/trainSuccess (JSON, Rhasspy only)
    • Indicates that training was successful
    • siteId: string - Hermes site ID (part of topic)
    • id: string? = null - unique ID from request (copied from train)
    • Response to rhasspy/nlu/<siteId>/train

Text to Speech

  • hermes/tts/say (JSON)
    • Generate spoken audio for a sentence using the configured text to speech system
    • Automatically sends playBytes
      • playBytes.requestId = say.id
    • text: string - sentence to speak (required)
    • lang: string? = null - override language for TTS system
    • id: string? = null - unique ID for request (copied to sayFinished)
    • volume: float? = null - volume level to speak with (0 = off, 1 = full volume)
    • siteId: string = "default" - Hermes site ID
    • sessionId: string? = null - current session ID
    • Response(s)
  • hermes/tts/sayFinished (JSON)
    • Indicates that the text to speech system has finished speaking
    • id: string? = null - unique ID for request (copied from say)
    • siteId: string = "default" - Hermes site ID
    • Response to hermes/tts/say
  • hermes/error/tts (JSON, Rhasspy only)
    • Sent when an error occurs in the text to speech system
    • error: string - description of the error
    • context: string? = null - system-defined context of the error
    • siteId: string = "default" - Hermes site ID
    • sessionId: string? = null - current session ID
  • rhasspy/tts/getVoices (JSON, Rhasspy only)
    • Request available text to speech voices
    • id: string? = null - unique ID provided in response
    • siteId: string = "default" - Hermes site ID
  • rhasspy/tts/voices (JSON, Rhasspy only)
    • Response to rhasspy/tts/getVoices
    • voices: list[object] - available voices
      • voiceId: string - unique ID for voice
      • description: string? = null - human readable description of voice
    • id: string? = null - unique ID from request
    • siteId: string = "default" - Hermes site ID

HTTP API

Rhasspy's HTTP endpoints are documented below. You can also visit /api/ in your Rhasspy server (note the final slash) to try out each endpoint.

Application authors may want to use the rhasspy-client, which provides a high-level interface to a remote Rhasspy server.

Endpoints

  • /api/custom-words
    • GET custom word dictionary as plain text, or POST to overwrite it
    • See custom_words.txt in your profile directory
  • /api/backup-profile
    • GET a zip file with relevant profile files (sentences, slots, etc.)
  • /api/evaluate
    • POST archive with WAV/JSON files for batch testing
    • Returns JSON report
    • Every file foo.wav should have a foo.json with a recognized intent
    • Archive must be in a format supported by shutil.unpack_archive
  • /api/download-profile
    • POST to have Rhasspy to download missing profile artifacts
  • /api/handle-intent
    • POST Hermes intent as JSON to handle
  • /api/listen-for-command
    • POST to wake Rhasspy up and start listening for a voice command
    • Returns intent JSON when command is finished
    • ?nohass=true - stop Rhasspy from handling the intent
    • ?timeout=<seconds> - override default command timeout
    • ?entity=<entity>&value=<value> - set custom entities/values in recognized intent
  • /api/listen-for-wake
    • POST "on" to have Rhasspy listen for a wake word
    • POST "off" to disable wake word
    • ?siteId=site1,site2,... to apply to specific site(s)
  • /api/lookup
    • POST word as plain text to look up or guess pronunciation
    • ?n=<number> - return at most n guessed pronunciations
  • /api/microphones
    • GET list of available microphones
  • /api/mqtt/<TOPIC>
    • POST JSON payload to /api/mqtt/your/full/topic
      • Payload will be published to your/full/topic on MQTT broker
    • GET next MQTT message on TOPIC as JSON
      • Subscribes to your/full/topic with request to /api/mqtt/your/full/topic
      • Escape wildcard # as %23 and + as %2B
  • /api/phonemes
    • GET example phonemes from speech recognizer for your profile
    • See phoneme_examples.txt in your profile directory
  • /api/play-recording
    • POST to play last recorded voice command
    • GET to download WAV data from last recorded voice command
  • /api/play-wav
    • POST to play WAV data
    • Make sure to set Content-Type to audio/wav
    • ?siteId=site1,site2,... to apply to specific site(s)
  • /api/profile
    • GET the JSON for your profile, or POST to overwrite it
    • ?layers=profile to only see settings different from defaults.json
    • See profile.json in your profile directory
  • /api/restart
    • Restart Rhasspy server
  • /api/sentences
    • GET voice command templates or POST to overwrite
    • Set Accept: application/json to GET JSON with all sentence files
    • Set Content-Type: application/json to POST JSON with sentences for multiple files
    • See sentences.ini and intents directory in your profile
  • /api/set-volume
    • POST to set volume at one or more sites
    • Body text is volume level (0 = off, 1 = full volume)
    • ?siteId=site1,site2,... to apply to specific site(s)
  • /api/slots
    • GET slot values as JSON or POST to add to/overwrite them
    • ?overwrite_all=true to clear slots in JSON before writing
  • /api/speakers
    • GET list of available audio output devices
  • /api/speech-to-intent
    • POST a WAV file and have Rhasspy process it as a voice command
    • Returns intent JSON when command is finished
    • ?nohass=true - stop Rhasspy from handling the intent
    • ?entity=<entity>&value=<value> - set custom entity/value in recognized intent
  • /api/speech-to-text
    • POST a WAV file and have Rhasspy return the text transcription
    • Set Accept: application/json to receive JSON with more details
    • ?noheader=true - send raw 16-bit 16Khz mono audio without a WAV header
  • /api/start-recording
    • POST to have Rhasspy start recording a voice command
  • /api/stop-recording
    • POST to have Rhasspy stop recording and process recorded data as a voice command
    • Returns intent JSON when command has been processed
    • ?nohass=true - stop Rhasspy from handling the intent
    • ?entity=<entity>&value=<value> - set custom entity/value in recognized intent
  • /api/test-microphones
    • GET list of available microphones and if they're working
  • /api/text-to-intent
    • POST text and have Rhasspy process it as command
    • Returns intent JSON when command has been processed
    • ?nohass=true - stop Rhasspy from handling the intent
    • ?entity=<entity>&value=<value> - set custom entity/value in recognized intent
  • /api/text-to-speech
    • POST text and have Rhasspy speak it
    • ?voice=<voice> - override default TTS voice
    • ?language=<language> - override default TTS language or locale
    • ?repeat=true - have Rhasspy repeat the last sentence it spoke
    • ?volume=<volume> - volume level to speak at (0 = off, 1 = full volume)
    • ?siteId=site1,site2,... to apply to specific site(s)
  • /api/train
    • POST to re-train your profile
  • /api/tts-voices
    • GET JSON object with available text to speech voices
  • /api/wake-words
    • GET JSON object with available wake words
  • /api/unknown-words
    • GET words that Rhasspy doesn't know in your sentences
    • See unknown_words.txt in your profile directory

Websocket API

Profile Settings

All available profile sections and settings are listed below:

  • home_assistant - how to communicate with Home Assistant/Hass.io
    • url - Base URL of Home Assistant server (no /api)
    • access_token - long-lived access token for Home Assistant (Hass.io token is used automatically)
    • api_password - Password, if you have that enabled (deprecated)
    • pem_file - Full path to your PEM certificate file
    • key_file - Full path to your key file (if separate, optional)
    • event_type_format - Python format string used to create event type from intent type ({0})
  • speech_to_text - transcribing voice commands to text
    • system - name of speech to text system (pocketsphinx, kaldi, remote, command, remote, hermes, or dummy)
    • pocketsphinx - configuration for Pocketsphinx
      • compatible - true if profile can use pocketsphinx for speech recognition
      • acoustic_model - directory with CMU 16 kHz acoustic model
      • base_dictionary - large text file with word pronunciations (read only)
      • custom_words - small text file with words/pronunciations added by user
      • dictionary - text file with all words/pronunciations needed for example sentences
      • unknown_words - small text file with guessed word pronunciations (from phonetisaurus)
      • language_model - text file with trigram ARPA language model built from example sentences
      • open_transcription - true if general language model should be used (custom voices commands ignored)
      • base_language_model - large general language model (read only)
      • mllr_matrix - MLLR matrix from acoustic model tuning
      • mix_weight - how much of the base language model to mix in during training (0-1)
      • phoneme_examples - text file with examples for each acoustic model phoneme
      • phoneme_map - text file mapping ASR phonemes to eSpeak phonemes
    • kaldi - configuration for Kaldi
      • compatible - true if profile can use Kaldi for speech recognition
      • kaldi_dir - absolute path to Kaldi root directory
      • model_dir - directory where Kaldi model is stored (relative to profile directory)
      • graph - directory where HCLG.fst is located (relative to model_dir)
      • base_graph - directory where large general HCLG.fst is located (relative to model_dir)
      • base_dictionary - large text file with word pronunciations (read only)
      • custom_words - small text file with words/pronunciations added by user
      • dictionary - text file with all words/pronunciations needed for example sentences
      • open_transcription - true if general language model should be used (custom voices commands ignored)
      • unknown_words - small text file with guessed word pronunciations (from phonetisaurus)
      • mix_weight - how much of the base language model to mix in during training (0-1)
      • phoneme_examples - text file with examples for each acoustic model phoneme
      • phoneme_map - text file mapping ASR phonemes to eSpeak phonemes
    • remote - configuration for remote Rhasspy server
      • url - URL to POST WAV data for transcription (e.g., http://your-rhasspy-server:12101/api/speech-to-text)
    • command - configuration for external speech-to-text program
      • program - path to executable
      • arguments - list of arguments to pass to program
    • sentences_ini - Ini file with example sentences/JSGF templates grouped by intent
    • sentences_dir - Directory with additional sentence templates (default: intents)
    • g2p_model - finite-state transducer for phonetisaurus to guess word pronunciations
    • g2p_casing - casing to force for g2p model (upper, lower, or blank)
    • dictionary_casing - casing to force for dictionary words (upper, lower, or blank)
    • slots_dir - directory to look for slots lists (default: slots)
    • slot_programs - directory to look for slot programs (default slot_programs)
  • intent - transforming text commands to intents
    • system - intent recognition system (fsticuffs, fuzzywuzzy, rasa, remote, adapt, command, or dummy)
    • fsticuffs - configuration for OpenFST-based intent recognizer
      • intent_json - path to intent graph JSON file generated by [rhasspy-nlu][https://github.com/rhasspy/rhasspy-nlu]
      • converters_dir - directory to look for converter programs (default: converters)
      • ignore_unknown_words - true if words not in the FST symbol table should be ignored
      • fuzzy - true if text is matching in a fuzzy manner, skipping words in stop_words.txt
    • fuzzywuzzy - configuration for simplistic Levenshtein distance based intent recognizer
      • examples_json - JSON file with intents/example sentences
      • min_confidence - minimum confidence required for intent to be converted to a JSON event (0-1)
    • remote - configuration for remote Rhasspy server
      • url - URL to POST text to for intent recognition (e.g., http://your-rhasspy-server:12101/api/text-to-intent)
    • rasa - configuration for Rasa NLU based intent recognizer
      • url - URL of remote Rasa NLU server (e.g., http://localhost:5005/)
      • examples_markdown - Markdown file to generate with intents/example sentences
      • project_name - name of project to generate during training
    • adapt - configuration for Mycroft Adapt based intent recognizer
      • stop_words - text file with words to ignore in training sentences
    • command - configuration for external speech-to-text program
      • program - path to executable
      • arguments - list of arguments to pass to program
    • replace_numbers if true, automatically replace number ranges (N..M) or numbers (N) with words
  • text_to_speech - pronouncing words
    • system - text to speech system (espeak, flite, picotts, marytts, command, remote, command, hermes, or dummy)
    • espeak - configuration for eSpeak
      • voice - name of voice to use (e.g., en, fr)
    • flite - configuration for flite
      • voice - name of voice to use (e.g., kal16, rms, awb)
    • picotts - configuration for PicoTTS
      • language - language to use (default if not present)
    • marytts - configuration for MaryTTS
      • url - address:port of MaryTTS server (port is usually 59125)
      • voice - name of voice to use (e.g., cmu-slt). Default if not present.
      • locale - name of locale to use (e.g., en-US). Default if not present.
    • wavenet - configuration for Google's WaveNet
      • cache_dir - path to directory in your profile where WAV files are cached
      • credentials_json - path to the JSON credentials file (generated online)
      • gender - gender of speaker (MALE FEMALE)
      • language_code - language/locale e.g. en-US,
      • sample_rate - WAV sample rate (default: 22050)
      • url - URL of WaveNet endpoint
      • voice - voice to use (e.g., Wavenet-C)
      • fallback_tts - text to speech system to use when offline or error occurs (e.g., espeak)
    • remote - configuration for remote text to speech server
      • url - URL to POST sentence to and get back WAV data
    • command - configuration for external text-to-speech program
      • say_program - path to executable for text to WAV
      • say_arguments - list of arguments to pass to say program
      • voices_program - path to executable for listing available voices
      • voices_arguments - list of arguments to pass to voices program
  • training - training speech/intent recognizers
    • speech_to_text - training for speech decoder
      • system - speech to text training system (auto or dummy)
      • command - configuration for external speech-to-text training program
        • program - path to executable
        • arguments - list of arguments to pass to program
      • remote - configuration for external HTTP endpoint
        • url - URL of speech to text training endpoint
    • intent - training for intent recognizer
      • system - intent recognizer training system (auto or dummy)
      • command - configuration for external intent recognizer training program
        • program - path to executable
        • arguments - list of arguments to pass to program
      • remote - configuration for external HTTP endpoint
        • url - URL of intent recognizer training endpoint
  • wake - waking Rhasspy up for speech input
    • system - wake word recognition system (raven, pocketsphinx, snowboy, precise, porcupine, command, hermes, or dummy)
    • raven - configuration for Raven wake word recognizer
      • template_dir - directory where WAV templates are stored in profile (default: raven)
      • probability_threshold - list with lower/upper probability range for detection (default: [0.45, 0.55])
      • minimum_matches - number of templates that must match for a detection (default: 1)
    • pocketsphinx - configuration for Pocketsphinx wake word recognizer
      • keyphrase - phrase to wake up on (3-4 syllables recommended)
      • threshold - sensitivity of detection (recommended range 1e-50 to 1e-5)
      • chunk_size - number of bytes per chunk to feed to Pocketsphinx (default 960)
    • snowboy - configuration for snowboy
      • model - path to model file(s), separated by commas (in profile directory)
      • sensitivity - model sensitivity (0-1, default 0.5)
      • audio_gain - audio gain (default 1)
      • apply_frontend - true if ApplyFrontend should be set
      • chunk_size - number of bytes per chunk to feed to snowboy (default 960)
      • model_settings - settings for each snowboy model path (e.g., snowboy/snowboy.umdl)
        • <MODEL_PATH>
          • sensitivity - model sensitivity
          • audio_gain - audio gain
          • apply_frontend - true if ApplyFrontend should be set
    • precise - configuration for Mycroft Precise
      • engine_path - path to the precise-engine binary
      • model - path to model file (in profile directory)
      • sensitivity - model sensitivity (0-1, default 0.5)
      • trigger_level - number of events to trigger activation (default 3)
      • chunk_size - number of bytes per chunk to feed to Precise (default 2048)
    • porcupine - configuration for PicoVoice's Porcupine
      • library_path - path to libpv_porcupine.so for your platform/architecture
      • model_path - path to the porcupine_params.pv (lib/common)
      • keyword_path - path to the .ppn keyword file
      • sensitivity - model sensitivity (0-1, default 0.5)
    • command - configuration for external speech-to-text program
      • program - path to executable
      • arguments - list of arguments to pass to program
  • microphone - configuration for audio recording
    • system - audio recording system (pyaudio, arecord, gstreamer, ordummy`)
    • pyaudio - configuration for PyAudio microphone
      • device - index of device to use or empty for default device
      • frames_per_buffer - number of frames to read at a time (default 480)
    • arecord - configuration for ALSA microphone
      • device - name of ALSA device (see arecord -L) to use or empty for default device
      • chunk_size - number of bytes to read at a time (default 960)
    • command - configuration for external audio input program
      • record_program - path to executable for audio input
      • record_arguments - list of arguments to pass to record program
      • list_program - path to executable for listing available output devices
      • list_arguments - list of arguments to pass to list program
      • test_program - path to executable for testing available output devices
      • test_arguments - list of arguments to pass to test program
  • sounds - configuration for audio output from Rhasspy
    • system - which sound output system to use (aplay, command, remote, hermes, or dummy)
    • wake - path to WAV file to play when Rhasspy wakes up
    • recorded - path to WAV file to play when a command finishes recording
    • aplay - configuration for ALSA speakers
      • device - name of ALSA device (see aplay -L) to use or empty for default device
    • command - configuration for external audio output program
      • play_program - path to executable for audio output
      • play_arguments - list of arguments to pass to play program
      • list_program - path to executable for listing available output devices
      • list_arguments - list of arguments to pass to list program
    • remote - configuration for remote audio output server
      • url - URL to POST WAV data to
  • handle
    • system - which intent handling system to use (hass, command, remote, command, or dummy)
    • remote - configuration for remote HTTP intent handler
      • url - URL to POST intent JSON to and receive response JSON from
    • command - configuration for external speech-to-text program
      • program - path to executable
      • arguments - list of arguments to pass to program
  • mqtt - configuration for MQTT
    • enabled - true if external broker should be used (false uses internal broker on port 12183)
    • host - external MQTT host
    • port - external MQTT port
    • username - external MQTT username (blank for anonymous)
    • password - external MQTT password
    • site_id - one or more Hermes site IDs (comma separated). First ID is used for new messages
  • dialogue - configuration for Hermes dialogue manager
    • system - which dialogue manager to use (rhasspy, hermes, or dummy)
    • group_separator - separator to use when grouping satellites (e.g., bedroom.front, bedroom.back)
  • download - configuration for profile file downloading
    • url_base - base URL to download profile artifacts (defaults to Github)
    • conditions - profile settings that will trigger file downloads
      • keys are profile setting paths (e.g., wake.system)
      • values are dictionaries whose keys are profile settings values (e.g., snowboy)
        • settings may have the form <=N or !X to mean "less than or equal to N" or "not X"
        • leaf nodes are dictionaries whose keys are destination file paths and whose values reference the files dictionary
    • files - locations, etc. of files to download
      • keys are names of files
      • values are dictionaries with:
        • url - URL of file to download (appended to url_base)
        • bytes_expected - number of bytes file should be after decompression
        • unzip - true if file should be decompressed with gunzip
        • parts - list of objects representing parts of a file that should be combined with cat
          • fragment - fragment appended to file URL
          • bytes_expected - number of bytes for this part
  • logging - settings for service loggers

Data Formats

In addition to the message formats specified in the Hermes protocol, Rhasspy has its own formats for transcriptions and intents. A Rhasspy profile also contains artifacts in standard formats, such as pronunciation dictionaries, language models, and grapheme to phoneme models.

Transcriptions

The /api/speech-to-text HTTP endpoint and /api/events/text Websocket endpoint produce JSON in the following format:

{
    "text": "transcription text",
    "transcribe_seconds": 0.123,
    "likelihood": 0.321,
    "wav_seconds": 1.456
}

where

  • text is the most likely transcription of the audio data (string)
  • transcribe_seconds is the number of seconds it took to transcribe (number)
  • likelihood is a confidence value returned by the ASR system (number)
  • wav_seconds is the duration of the WAV audio in seconds (number)

Intents

The /api/text-to-intent, /api/speech-to-intent, /api/listen-for-command, and /api/stop-recording HTTP endpoints as well as the /api/events/intent Websocket endpoint produce JSON in the following format:

{
    "intent": {
        "name": "NameOfIntent",
        "confidence": 1.0
    },
    "entities": [
        { "entity": "entity_1", "value": "value_1", "raw_value": "value_1",
          "start": 0, "end": 1, "raw_start": 0, "raw_end": 1 },
        { "entity": "entity_2", "value": "value_2", "raw_value": "value_2",
          "start": 0, "end": 1, "raw_start": 0, "raw_end": 1 }
    ],
    "slots": {
        "entity_1": "value_1",
        "entity_2": "value_2"
    },
    "text": "transcription text with substitutions",
    "raw_text": "transcription text without substitutions",
    "tokens": ["transcription", "text", "with", "substitutions"],
    "raw_tokens": ["transcription", "text", "without", "substitutions"],
    "recognize_seconds": 0.001

}

where

  • intent describes the recognized intent (object)
    • name is the name of the recognized intent (section headers in your sentences.ini) (string)
    • confidence is a value between 0 and 1, with 1 being maximally confident (number)
  • entities is a list of recognized entities (list)
    • entity is the name of the slot (string)
    • value is the (substitued) value (string)
    • raw_value is the (non-substitued) value (string)
    • start is the zero-based start index of the entity in text (number)
    • raw_start is the zero-based start index of the entity in raw_text (number)
    • stop is the zero-based stop index (exclusive) of the entity in text (number)
    • raw_stop is the zero-based stop index (exclusive) of the entity in raw_text (number)
  • slots is a dictionary of entities/values (object)
    • Assumes one value per entity. See entities for complete list.
  • text is the input text with substitutions (string)
  • raw_text is the input text without substitutions
  • tokens is the list of words/tokens in text
  • raw_tokens is the list of words/tokens in raw_text
  • recognize_seconds is the number of seconds it took to recognize the intent and slots (number)

Pronunciation Dictionaries

Dictionaries are expected in plaintext, with the following format:

word1 P1 P2 P3
word2 P1 P4 P5
...

Each line starts with a word and, after some whitespace, a list of phonemes are given (separated by whitespace). These phonemes must match what the acoustic model was trained to recognize.

Multiple pronunciations for the same word are possible, and may optionally contain an index:

word P1 P2 P3
word(1) P2 P2 P3
word(2) P3 P2 P3

A rhasspy profile will typically contain 3 dictionaries:

  1. base_dictionary.txt
    • A large, pre-built dictionary with most of the words in a given language
  2. custom_words.txt
    • A small, user-defined dictionary with custom words or pronunciations
  3. dictionary.txt
    • Contains exactly the vocabulary needed for a profile
    • Automatically generated during training

Language Models

Language models must be in plaintext ARPA format.

A rhasspy profile will typically contain 2 language models:

  1. base_language_model.txt
    • A large, pre-built language model that summarizes a given language
    • Used when the open_transcription setting is true for the ASR system (e.g., speech_to_text.pocketsphinx.open_transcription)
    • Used during language model mixing
  2. language_model.txt
    • Summarizes the valid voice commands for a profile
    • Automatically generated during training

Grapheme To Phoneme Models

A grapheme-to-phoneme (g2p) model helps guess the pronunciations of words outside of the dictionary. These models are trained on each profile's base_dictionary.txt file using phonetisaurus and saved in the OpenFST binary format.

G2P prediction can also be done using transformer models.

Command Line Tools