Text to Speech

After you voice command has been handled, it's common to produce speech as a response back to the user. Rhasspy has support for several text to speech systems which, importantly, can be played through any of the audio output systems.

The following table summarizes language support for the various text to speech systems:

Language	espeak	flite	picotts	nanotts	marytts	opentts	wavenet	larynx
ca	✓					✓
cs	✓					✓	✓
en	✓	✓	✓	✓	✓	✓	✓	✓
de	✓		✓	✓	✓	✓	✓	✓
fr	✓		✓	✓	✓	✓	✓	✓
el	✓					✓
es	✓		✓	✓		✓	✓	✓
hi	✓	✓				✓	✓
it	✓		✓	✓	✓	✓	✓	✓
nl	✓					✓	✓	✓
pt	✓					✓	✓
pl	✓					✓	✓
ru	✓				✓	✓	✓	✓
sv	✓					✓	✓	✓
vi	✓					✓
zh	✓					✓	✓

eSpeak

Uses eSpeak to speak sentences. This is the default text to speech system and, while it sounds robotic, has the widest support for different languages.

Add to your profile:

"text_to_speech": {
  "system": "espeak",
  "espeak": {
    "voice": "en"
  }
}

Remove the voice option to have espeak use your profile's language automatically.

You may also pass additional arguments to the espeak command. For example,

"text_to_speech": {
  "system": "espeak",
  "espeak": {
    "arguments": ["-s", "80"]
  }
}

will speak the sentence more slowly.

Implemented by rhasspy-tts-cli-hermes

Flite

Uses FestVox's flite for speech. Sounds better than espeak in most cases, but supports fewer languages out of the box.

Add to your profile:

"text_to_speech": {
  "system": "flite",
  "flite": {
    "voice": "kal16"
  }
}

Some other included voices are rms, slt, and awb.

Implemented by rhasspy-tts-cli-hermes

PicoTTS

Uses SVOX's picotts for text to speech. Sounds a bit better (to me) than flite or espeak.

Included languages are en-US, en-GB, de-DE, es-ES, fr-FR and it-IT.

Add to your profile:

"text_to_speech": {
  "system": "picotts",
  "picotts": {
    "language": "en-US"
  }
}

Implemented by rhasspy-tts-cli-hermes

NanoTTS

Uses an improved fork of SVOX's picoTTS for text to speech.

Included languages are en-US, en-GB, de-DE, es-ES, fr-FR and it-IT.

Add to your profile:

"text_to_speech": {
  "system": "nanotts",
  "nanotts": {
    "language": "en-GB"
  }
}

Implemented by rhasspy-tts-cli-hermes

MaryTTS

Uses a remote MaryTTS web server. Supported languages include German, British and American English, French, Italian, Luxembourgish, Russian, Swedish, Telugu, and Turkish. A MaryTTS Docker image is available with many voices included.

Add to your profile:

"text_to_speech": {
  "system": "marytts",
  "marytts": {
    "url": "http://localhost:59125/process",
    "voice": "cmu-slt",
    "locale": "en-US"
  }
}

To run the Docker image, simply execute:

$ docker run -it -p 59125:59125 synesthesiam/marytts:5.2

and visit http://localhost:59125 after it starts.

If you're using docker compose, add the following to your docker-compose.yml file:

services:
  marytts:
    image: synesthesiam/marytts:5.2
    ports:
      - 59125:59125

When using docker-compose, set marytts.url in your profile to be http://marytts:59125/process. This will allow Rhasspy to resolve the address of its sibling container.

To save memory when running on a Raspberry Pi, considering restricting the loaded voices by passing one or more --voice <VOICE> command-line arguments to the Docker container.

Audio Effects

MaryTTS is capable of applying several audio effects when producing speech. See the web interface at http://localhost:59125 to experiment with this.

To use these effects within Rhasspy, set text_to_speech.marytts.effects within your profile, for example:

"text_to_speech": {
   "system": "marytts",
   "marytts": {
        "url": "http://localhost:59125/process",
        "effects": {
            "effect_Volume_selected": "on",
            "effect_Volume_parameters": "amount=0.9;",
            "effect_TractScaler_selected": "on",
            "effect_TractScaler_parameters": "amount:1.2;",
            "effect_F0Add_selected": "on",
            "effect_F0Add_parameters": "f0Add:-50.0;",
            "effect_Robot_selected": "on",
            "effect_Robot_parameters": "amount=50.0;"
        }
    }
}

You can determine the names of the parameters by examining the web interface http://localhost:59125 using your browser's Developer Tools.

Implemented by rhasspy-tts-cli-hermes

OpenTTS

Uses a remote OpenTTS web server. Supports many different text to speech systems, including Mozilla TTS.

Add to your profile:

"text_to_speech": {
  "system": "opentts",
  "opentts": {
    "url": "http://localhost:5500",
    "voice": "nanaotts:en-GB"
  }
}

Voices in OpenTTS have the format <system>:<voice> where <system> is the text to speech system (e.g., espeak, flite, festival, nanotts, marytts, mozillatts) and <voice> is the id of the voice within that system. Visit the test page at http://localhost:5500 to see and test available voices.

To run the Docker image, simply execute:

$ docker run -it -p 5500:5500 synesthesiam/opentts

and visit http://localhost:5500 after it starts.

If you're using docker compose, add the following to your docker-compose.yml file:

services:
  opentts:
    image: synesthesiam/opentts
    ports:
      - 5500:5500

To run the full suite of text to speech systems offered by OpenTTS, add:

services:
  opentts:
    image: synesthesiam/opentts
    ports:
      - 5500:5500
    command: --marytts-url http://marytts:59125 --mozillatts-url http://mozillatts:5002
    tty: true
  marytts:
    image: synesthesiam/marytts:5.2
    tty: true
  mozillatts:
    image: synesthesiam/mozilla-tts
    tty: true

NOTE: Mozilla TTS will not run on a Raspberry Pi.

When using docker-compose, set opentts.url in your profile to be http://opentts:5500. This will allow Rhasspy to resolve the address of its sibling container.

Implemented by rhasspy-tts-cli-hermes

Google WaveNet

Uses Google's WaveNet text to speech system. This requires a Google account and an internet connection to function. Rhasspy will cache WAV files for previously spoken sentences, but you will be sending Google information for every new sentence that Rhasspy speaks.

Add to your profile:

"text_to_speech": {
  "system": "wavenet",
  "wavenet": {
    "cache_dir": "tts/googlewavenet/cache",
    "credentials_json": "tts/googlewavenet/credentials.json",
    "gender": "FEMALE",
    "language_code": "en-US",
    "sample_rate": 22050,
    "url": "https://texttospeech.googleapis.com/v1/text:synthesize",
    "voice": "Wavenet-C"
  }
}

Before using WaveNet, you must set up a Google cloud account and generate a JSON credentials file. Save the JSON credentials file to wherever wavenet.credentials_json points to in your profile directory. You may also need to visit your Google cloud account settings and enable the text-to-speech API.

WAV files of each sentence are cached in wavenet.cache_dir in your profile directory. Sentences are cached based on their text and the gender, voice, language_code, and sample_rate of the wavenet system. Changing any of these things will require using the Google API.

Contributed by Romkabouter.

Implemented by rhasspy-tts-wavenet-hermes

Larynx

Text to speech system that uses pre-trained onnx models (currently GlowTTS and Hi-Fi GAN). Text to phoneme conversion is done with gruut.

Listen to voice samples here

NOTE: Performance of Larynx on the Raspberry Pi is significantly better on the 64-bit version of Raspberry Pi OS.

Add to your profile:

"text_to_speech": {
  "system": "larynx",
  "larynx": {
    "cache_dir": "tts/larynx/cache",
    "default_voice": "myvoice",
    "vocoder": "universal_large"
    }
  }
}

See below for a list of available voices and vocoders.

Larynx Voices

All voices were trained from public audio datasets. See here for license information.

English (en-us, 20 voices)
- blizzard_fls (F, accent)
- cmu_aew (M)
- cmu_ahw (M)
- cmu_aup (M)
- cmu_bdl (M)
- cmu_clb (F)
- cmu_eey (F)
- cmu_fem (M)
- cmu_jmk (M)
- cmu_ksp (M, accent)
- cmu_ljm (F)
- cmu_lnh (F)
- cmu_rms (M)
- cmu_rxr (M)
- cmu_slp (F, accent)
- cmu_slt (F)
- ek (F, accent)
- harvard (F, accent)
- kathleen (F)
- ljspeech (F)
German (de-de, 1 voice)
- thorsten (M)
French (fr-fr, 3 voices)
- gilles_le_blanc (M)
- siwis (F)
- tom (M)
Spanish (es-es, 2 voices)
- carlfm (M)
- karen_savage (F)
Dutch (nl, 3 voices)
- bart_de_leeuw (M)
- flemishguy (M)
- rdh (M)
Italian (it-it, 2 voices)
- lisa (F)
- riccardo_fasol (M)
Swedish (sv-se, 1 voice)
- talesyntese (M)
Russian (ru-ru, 3 voices)
- hajdurova (F)
- nikolaev (M)
- minaev (M)

Larynx Vocoders

A vocoder transforms the output of each voice (mels) into waveform audio.

universal_large
- High quality, but slow on the Raspberry Pi
vctk_medium
- Medium quality
vctk_small
- Lower quality, but fast on the Raspberry Pi

Please contact a Rhasspy developer if you'd like to volunteer your voice!

Implemented by rhasspy-tts-larynx-hermes

Home Assistant TTS Platform

Not supported yet in 2.5!

Use a TTS platform on your Home Assistant server.

Add to your profile:

"text_to_speech": {
  "system": "hass_tts",
  "hass_tts": {
      "platform": "..."
  }
}

The settings from your profile's home_assistant section are automatically used (URL, access token, etc.).

Remote

Simply POSTs the sentence to be spoken to an HTTP endpoint. Expects WAV audio back with a Content-Type: audio/wav header.

Add to your profile:

"text_to_speech": {
  "system": "remote",
  "remote": {
      "url": "http://tts-server/endpoint"
  }
}

The lang property of hermes/tts/say is not empty, a query parameter language is added to the URL (e.g., http://tts-server/endpoint?language=<lang>).

Implemented by rhasspy-remote-http-hermes

Command

You can extend Rhasspy easily with your own external text to speech system. When a sentence needs to be spoken, Rhasspy will call your custom program with the text given on standard in. Your program should return the corresponding WAV data on standard out.

Add to your profile:

"text_to_speech": {
  "system": "command",
  "command": {
      "say_program": "/path/to/say/program",
      "say_arguments": [],
      "voices_program": "/path/to/voices/program",
      "voices_arguments": [],
      "language": ""
  }
}

The text_to_speech.command.say_program is executed each time a text to speech request is made with arguments from text_to_speech.command.say_arguments. The command is run through Python's str.format function with a lang keyword argument set to text_to_speech.command.language (so {lang} is replaced). Rhasspy passes the sentence as the first command-line argument to the program and expects WAV audio back on standard out.

If provided, the text_to_speech.command.voices_program will be executed when a rhasspy/tts/getVoices message is received. The program should return a listing of available voices, one per line with the first whitespace-separated column being a unique voice ID.

Implemented by rhasspy-tts-cli-hermes

Dummy

Disables text to speech.

Add to your profile:

"text_to_speech": {
  "system": "dummy"
}