About

Rhasspy was created and is currently maintained by Michael Hansen.

Mike head

Special thanks to:

Supporting Tools

The following tools/libraries help to support Rhasspy:

alpinejs (web UI)
eSpeak (text to speech)
DeepSpeech (speech to text)
flite (text to speech)
Hypercorn (web server)
Kaldi (speech to text)
KenLM (language modeling)
MaryTTS (text to speech)
Montreal Forced Aligner (acoustic models)
mosquitto (MQTT broker)
Mycroft Precise
nanoTTS (text to speech)
Opengrm (language modeling)
paho (network communication)
Phonetisaurus (word pronunciations)
PicoTTS (text to speech)
Pocketsphinx (speech to text, wake word)
porcupine (wake word)
PyAudio (microphone)
Python 3.7
Quart (web framework)
rapidfuzz (fuzzy string matching)
Snips NLU (intent recognition)
snowboy (wake word)
Sox (WAV conversion)
supervisord (service management)
webrtcvad (voice activity detection)
Zamia Speech (acoustic models)

This project is supported by:

History

Rhasspy was originally inspired by Jasper, an "open source platform for developing always-on, voice-controlled applications". Rhasspy's original architecture (v1) was close to Jasper's, though the two systems handled speech/intent recognition in very different ways.

Jasper

Jasper runs on the Raspberry Pi's and is extendable through custom Python modules. It's also highly configurable, featuring multiple speech recognition engines, text to speech systems, and integration with online services (Facebook, Spotify, etc.).

Speech recognition in Jasper is done using pocketsphinx, specifically with the keyword search mode. User modules declare a list of WORDS that Jasper should listen for. The union set of all module WORDS is listened for at runtime, and transcriptions are given to the isValid functions of each module in PRIORITY order. When one returns True, Jasper calls the handle function to perform the module's intended action(s).

# ---------------------
# Example Jasper Module
# ---------------------

# Orders modules in case of a conflict
PRIORITY = 1

# Bag of words for keyword search
WORDS = ["MEANING", "OF", "LIFE"]

# Return true if transcription is valid for module
def isValid(text):
    return re.search(r"\bmeaning of life\b", text, re.IGNORECASE)

# Handle transcription
def handle(text, mic, profile):
    mic.say("It's 42")

Rhasspy v1

The first version of Rhasspy (originally named "wraspy") followed Jasper in its use of pocketsphinx for speech recognition, but with an ARPA language model instead of just keywords. Rhasspy user modules (similar in spirit to Jasper's) provided a set of training sentences that were compiled into a statistical model using cmuclmtk, a language modeling toolkit.

Inspired by the Markdown-like language used in the rasaNLU training data format, sentences were annotated with extra information to aid in post-processing. For example, the sentence turn on the living room lamp might be annotated as turn [on](state) the [living room lamp](name). While pocketsphinx would recognize the bare sentence, the Rhasspy user module would receive a pre-processed intent with state and name slots. This greatly simplified the user module handling code, since the intent's slots could be used directly instead of requiring upfront text processing in each and every module.

Example part of a Rhasspy v1 user module:

def get_training_phrases(self):
    """Returns a list of annotated training phrases."""
    phrases = []

    # Create an open/closed question for each door
    for door_name in self.doors:
        for state in ["open", "closed"]:
            phrases.append([
                'is the [{0}](location) door [{1}](state)?'.format(door_name, state)
            ])

    return phrases

Some limitations with this approach become apparent with use, however:

Training sentences with optional words or multiple values for a slot must be constructed in code
There is no easy way to get an overview of what voice commands are available
Intent handling is baked into each individual module, making it difficult to interact with other IoT systems (e.g., Node-RED)
Users unfamiliar with Python cannot extend the system

Rhasspy v1 shared many of the same sub-systems with Jasper, such as pocketsphinx for wake word detection, phonetisaurus for guessing unknown word pronunciations, and MaryTTS for text to speech.

Rhasspy v1.5

To address the limitations of v1, a version of Rhasspy was developed as a set of custom components in Home Assistant, an open source IoT framework for home automation. In this version of Rhasspy (dubbed v1.5 here), users ran Rhasspy as part of their Home Assistant processes, controlling lights, etc. with voice commands.

In contrast to v1, there were no user modules in v1.5. Annotated training sentences were provided in a single Markdown file, and intents were handled directly with the built-in scripting capability of Home Assistant automations. This allowed non-programmers to extend Rhasspy, and dramatically increased the reach of intent handling beyond simple Python functions.

Example Rhasspy v1.5 training sentences:

## intent:ChangeLightState
- turn [on](state) the [living room lamp](name)
- turn [off](state) the [living room lamp](name)

Support for new sub-systems was added in v1.5, specifically the snowboy and Mycroft Precise wake word systems. Some additional capabilities were also introduced, such as the ability to "mix" the language model generated from training sentences with a larger, pre-trained language model (usually generated from books, newspapers, etc.).

While it was met with some interest from the Home Assistant community, Rhasspy v1.5 could not be used in Hass.io, a Docker-based Home Assistant virtual appliance. Additionally, Snips.AI had a great deal of momentum already in this space (offline, Raspberry Pi IoT) for English and French users. With this in mind, Rhasspy pivoted to work as a Hass.io add-on with a greater focus on non-English speakers.

Rhasspy v2

Version 2 of Rhasspy was re-written from scratch using an actor model, where each actor runs in a separate thread and passes messages to other actors. All sub-systems were represented as stateful actors, handling messages differently depending on their current states. A central Dialogue Manager actor was responsible for creating/configuring sub-actors according to a user's profile, and for responding to requests from the user via a web interface.

V2 Architecture

Messages between actors included audio data, requests/responses, errors, and internal state information. Every behavior in Rhasspy v2 was accomplished with the coordination of several actors.

The notion of a profile, borrowed originally from Jasper, was extended in v2 to allow for different languages. As of August 2019, Rhasspy v2 supported 13 languages (with varying degrees of success). Compatible pocketsphinx models were available for many of the desired languages, but it was eventually necessary to add support for Kaldi, a speech recognition toolkit from Johns Hopkins. With the Kaldi acoustic models released for the Montreal Forced Aligner, Rhasspy has access to many languages that are not commonly supported.

A major change from v1.5 was the introduction of sentences.ini, a new format for specifying training sentences. This format uses simplified JSGF grammars to concisely describe sentences with optional words, alternative clauses, and re-usable rules. These sentence templates are grouped using ini-style blocks, which each represent an intent.

[ChangeLightState]
states = (on | off)
turn (<states>){state} the (living room lamp){name}

During the training process, Rhasspy v2 generated all possible annotated sentences from sentences.ini, and used them to train both a speech and intent recognizer. Transcriptions from the speech recognizer were fed directly into the intent recognizer, which had been trained to receive them!

v2 sentences

Besides the addition of Kaldi, Rhasspy v2 included support for multiple intent recognizers and basic integration with Snips.ai via the Hermes protocol. This MQTT-based protocol allowed Rhasspy to receive remote microphone input, play sounds/speak text remotely, and be woken up by a Snips.ai server. Because of Rhasspy's extended language support, this made it possible for Snips.ai users to swap out the speech-to-text module with Rhasspy while keeping the rest of their set-up intact.

Through its REST API and a websocket connection, Rhasspy was also able to interact directly with Node-RED, allowing users to create custom flows graphically. These flows could respond to recognized intents from Rhasspy, further extending Rhasspy beyond only devices that Home Assistant could control.

Rhasspy v2.4 and the Death of Snips

Generating all possible training sentences in Rhasspy v2 was done using a finite state transducer. While fast for small assistants, it could not scale. Thanks to the opengrm toolkit, the process of generating sentences could be bypassed entirely by going straight to a language model! Overnight, Rhasspy was suddenly capable of training millions of possible voice commands in a few seconds.

Rhasspy 2.4 added support for more languages, speech systems, and intent recognizers. It continues to be maintained and can be installed via Docker, Hass.IO, and within a virtual environment. While this version of Rhasspy is stable and usable for many users, its architecture is inherently monolithic. As a single Python application, contributors must understand a great deal about Rhasspy's internals before making any modifications. Additionally, installations of Rhasspy with only a subset of its components is difficult and fragile (a single global import of an optional dependency will break everything). To broaden the base of potential contributors and make installation more modular, a new architecture was needed.

When Snips.AI announced they had been bought by Sonos in 2019, development plans for Rhasspy 2.5 pivoted to fill in the gap. By fracturing Rhasspy into multiple services that communicate over an MQTT broker using an extended version of the Hermes protocol, Rhasspy 2.5 aims to:

Maintain compatibility with Rhasspy 2.4 profiles and APIs
Enable contributors to write Rhasspy apps/skills in any language that can speak MQTT and JSON
Provide a migration path for Snips skill authors to move their Hermes-based services to Rhasspy
Allow for individual Rhasspy components to be developed independently in separate repositories
Make modular installations possible, with any combination of individual services