Text to Speech

After you voice command has been handled, it's common to produce speech as a response back to the user. Rhasspy has support for several text to speech systems which, importantly, can be played through any of the audio output systems.

The following table summarizes language support for the various text to speech systems:

System en de es fr it nl ru el hi zh vi pt ca
espeak
flite
picotts
marytts
wavenet

eSpeak

Uses eSpeak to speak sentences. This is the default text to speech system and, while it sounds robotic, has the widest support for different languages.

Add to your profile:

"text_to_speech": {
  "system": "espeak",
  "espeak": {
    "voice": "en"
  }
}

Remove the voice option to have espeak use your profile's language automatically.

You may also pass additional arguments to the espeak command. For example,

"text_to_speech": {
  "system": "espeak",
  "espeak": {
    "arguments": ["-s", "80"]
  }
}

will speak the sentence more slowly.

See rhasspy.tts.EspeakSentenceSpeaker for more details.

Flite

Uses FestVox's flite for speech. Sounds better than espeak in most cases, but only supports English out of the box.

Add to your profile:

"text_to_speech": {
  "system": "flite",
  "flite": {
    "voice": "kal16"
  }
}

Some other included voices are rms, slt, and awb.

See rhasspy.tts.FliteSentenceSpeaker for details.

PicoTTS

Uses SVOX's picotts for text to speech. Sounds a bit better (to me) than flite or espeak.

Included languages are en-US, en-GB, de-DE, es-ES, fr-FR and it-IT.

Add to your profile:

"text_to_speech": {
  "system": "picotts",
  "picotts": {
    "language": "en-US"
  }
}

See rhasspy.tts.PicoTTSSentenceSpeaker for details.

MaryTTS

Uses a remote MaryTTS web server. Supported languages include German, British and American English, French, Italian, Luxembourgish, Russian, Swedish, Telugu, and Turkish. An MaryTTS Docker image is available, though only the default voice is included.

Add to your profile:

"text_to_speech": {
  "system": "marytts",
  "marytts": {
    "url": "http://localhost:59125",
    "voice": "cmu-slt",
    "locale": "en-US"
  }
}

To run the Docker image, simply execute:

docker run -it -p 59125:59125 synesthesiam/marytts:5.2

and visit http://localhost:59125 after it starts.

If you're using docker compose, add the following to your docker-compose.yml file:

marytts:
  image: synesthesiam/marytts:5.2
  restart: unless-stopped
  ports:
    - "59125:59125"

When using docker-compose, set marytts.url in your profile to be http://marytts:59125. This will allow rhasspy, from within its docker container, to resolve and connect to marytts (its sibling container).

Adding Voices

For more English voices, run the following commands in a Bash shell:

mkdir -p marytts-5.2/download
for voice in dfki-prudence dfki-poppy dfki-obadiah dfki-spike cmu-bdl cmu-rms;
  do wget -O marytts-5.2/download/voice-${voice}-hsmm-5.2.zip https://github.com/marytts/voice-${voice}-hsmm/releases/download/v5.2/voice-${voice}-hsmm-5.2.zip;
  unzip -d marytts-5.2 marytts-5.2/download/voice-${voice}-hsmm-5.2.zip;
done

Now run the Docker image again with the following command (in the same directory):

voice=dfki-prudence
docker run -it -p 59125:59125 -v "$(pwd)/marytts-5.2/lib/voice-${voice}-hsmm-5.2.jar:/marytts/lib/voice-${voice}-hsmm-5.2.jar" synesthesiam/marytts:5.2

Change the first line to select the voice you'd like to add. It's not recommended to link in all of the voices at once, since MaryTTS seems to load them all into memory and overwhelm the RAM of a Raspberry Pi.

See rhasspy.tts.MaryTTSSentenceSpeaker for details.

Audio Effects

MaryTTS is capable of applying several audio effects when producing speech. See the web interface at http://localhost:59125 to experiment with this.

To use these effects within Rhasspy, set text_to_speech.marytts.effects within your profile, for example:

"text_to_speech": {
   "system": "marytts",
   "marytts": {
        "url": "http://localhost:59125",
        "effects": {
            "effect_Volume_selected": "on",
            "effect_Volume_parameters": "amount=0.9;",
            "effect_TractScaler_selected": "on",
            "effect_TractScaler_parameters": "amount:1.2;",
            "effect_F0Add_selected": "on",
            "effect_F0Add_parameters": "f0Add:-50.0;",
            "effect_Robot_selected": "on",
            "effect_Robot_parameters": "amount=50.0;"
        }
    }
}

You can determine the names of the parameters by examining the web interface http://localhost:59125 using your browser's Developer Tools.

Google WaveNet

Uses Google's WaveNet text to speech system. This requires a Google account and an internet connection to function. Rhasspy will cache WAV files for previously spoken sentences, but you will be sending Google information for every new sentence that Rhasspy speaks.

Add to your profile:

"text_to_speech": {
  "system": "wavenet",
  "wavenet": {
    "cache_dir": "tts/googlewavenet/cache",
    "credentials_json": "tts/googlewavenet/credentials.json",
    "gender": "FEMALE",
    "language_code": "en-US",
    "sample_rate": 22050,
    "url": "https://texttospeech.googleapis.com/v1/text:synthesize",
    "voice": "Wavenet-C",
    "fallback_tts": "espeak"
  }
}

Before using WaveNet, you must set up a Google cloud account and generate a JSON credentials file. Save the JSON credentials file to wherever wavenet.credentials_json points to in your profile directory. You may also need to visit your Google cloud account settings and enable the text-to-speech API.

WAV files of each sentence are cached in wavenet.cache_dir in your profile directory. Sentences are cached based on their text and the gender, voice, language_code, and sample_rate of the wavenet system. Changing any of these things will require using the Google API.

If there are problems using the Google API (e.g., your internet connection fails), Rhasspy will switch over to the text to speech system given in wavenet.fallback_tts. The settings for the fallback system will be loaded from your profile as expected.

Contributed by Romkabouter.

See rhasspy.tts.GoogleWaveNetSentenceSpeaker for details.

Home Assistant TTS Platform

Use a TTS platform on your Home Assistant server.

Add to your profile:

"text_to_speech": {
  "system": "hass_tts",
  "hass_tts": {
      "platform": "..."
  }
}

The settings from your profile's home_assistant section are automatically used (URL, access token, etc.).

See rhasspy.tts.HomeAssistantSentenceSpeaker for details.

Command

You can extend Rhasspy easily with your own external text to speech system. When a sentence needs to be spoken, Rhasspy will call your custom program with the text given on standard in. Your program should return the corresponding WAV data on standard out.

Add to your profile:

"text_to_speech": {
  "system": "command",
  "command": {
      "program": "/path/to/program",
      "arguments": []
  }
}

For compatibility with other services and Rhasspy components, it's best to return 16 kHz, 16-bit mono WAV data.

See rhasspy.tts.CommandSentenceSpeaker for details.

Dummy

Disables text to speech.

Add to your profile:

"text_to_speech": {
  "system": "dummy"
}

See rhasspy.tts.DummySentenceSpeaker for details.