The Master Plan

Languages
Phoneme Inventory and Text Processing
Speech to Text
Text to Speech
Wake Word
Audio Corpora

To keep pace in the short term and grow in the long term, the Rhasspy project needs to train its own models instead of scavenging the Internet for them. With the plethora of free audio corpora available, it's possible to create high quality speech to text and text to speech voices for most of Rhasspy's supported languages!

An over-arching theme in Rhasspy's Master Plan is the use of the International Phonetic Alphabet (IPA). All speech to text models and text to speech voices are trained with IPA phoneme inventories and lexicons.

Languages

Arabic (ar)
Catalan (ca)
Chinese (zh-cn)
Czech (cs)
English (en)
- U.S. (en-us)
- U.K. (en-gb)
Dutch (nl)
Finnish (fi)
French (fr)
German (de)
Greek (el)
Hebrew (he)
Italian (it)
Hindi (hi)
Hungarian (hu)
Japanese (ja)
Korean (ko)
Persian (fa)
Polish (pl)
Portuguese (pt-br)
Russian (ru)
Spanish (es-es)
Swedish (sv-se)
Ukranian (uk)
Vietnamese (vi-n)

Phoneme Inventory and Text Processing

Before a speech to text model or a text to speech voice can be trained with available audio corpora, several things are needed for each language:

A phoneme inventory using the International Phonetic Alphabet (IPA)
- Language phonologies
- gruut-ipa
A lexicon with word pronunciations and IPA phonemes
- May be derived from Wiktionary
- Stress is required for some languages
A grapheme-to-phoneme (g2p) model for predicting the pronunciations of unknown words
- Trained with phonetisaurus
A tokenizer for "words" from raw text
- Using whitespace and regular expressions
- gruut
A method for expanding numbers, etc. into lexicon words
- num2words
- babel

With all of the pieces, we can go from raw text (UTF-8) to phonemes (UTF-8 IPA) through this pipeline:

Raw text is split into tokens
Numbers, etc. are expanded into words
Words in the lexicon are looked up and turned into phonemes
Words not in the lexicon have their pronunciations guessed using g2p model

A currently missing piece involves step 3: some words have multiple pronunciations, and context may be necessary to determine the appropriate pronunciation. For example, the English word "read" may be pronounced like "red" (I read the book) or "reed" (I like to read) depending on context. gruut sidesteps this problem for now by letting you explicitly mark pronunciations as "word(n)" where "n" is the nth pronunciation in the lexicon.

In the future, we would like to pre-trained word vectors and alignments generated during the Kaldi training process to train language-specific classifiers for guessing which word pronunciation is appropriate given a word's context.

Completed

Available here

Czech (cs)
Dutch (nl)
English (en)
- U.S. (en-us)
- U.K. (en-gb)
French (fr)
German (de)
Greek (el)
Hebrew (he)
- Right to left writing system
Italian (it)
Persian (fa)
- Right to left writing system
- Using hazm for genitive case (e̞)
Portuguese (pt-br)
Russian (ru)
Spanish (es-es)
Swedish (sv-se)
- Has pitch accent
Vietnamese (vi-n)
- Has tones

Not Completed

Arabic (ar)
- Right to left writing system
Catalan (ca)
Chinese (zh-cn)
- Needs intelligent tokenizer
Finnish (fi)
Hindi (hi)
Hungarian (hu)
Japanese (ja)
- Multiple writing systems
- Needs intelligent tokenizer
Korean (ko)
Polish (pl)
Ukranian (uk)

Speech to Text

Transcribing speech to text is a core function of Rhasspy. High quality models are a plus (low word error rate), but even models with moderate quality can be quite useful because Rhasspy is intended to recognize a limited set of pre-scripted voice commands.

Kaldi

The ipa2kaldi project creates a Kaldi nnet3 recipe from one or more audio datasets using gruut to tokenize and phonemize the dataset transcriptions. The IPA phoneme inventories are used directly as well.

Completed

Czech (cs)
- WER: 11.03
French (fr)
- WER: 3.23
Italian (it)
- WER: 2.86
Russian (ru)
- Need to calculate WER
Spanish (es-es)
- WER: 2.03

Not Completed

Arabic (ar)
Catalan (ca)
Chinese (zh-cn)
English (en)
- U.S. (en-us)
- U.K. (en-gb)
Dutch (nl)
Finnish (fi)
German (de)
Greek (el)
Hebrew (he)
Hindi (hi)
Hungarian (hu)
Japanese (ja)
Korean (ko)
Persian (fa)
Polish (pl)
Portuguese (pt-br)
Swedish (sv-se)
Ukranian (uk)
Vietnamese (vi-n)

Mozilla DeepSpeech

Besides Kaldi, Mozilla's DeepSpeech is another useful speech to text tool. We would like to produce a fork that uses gruut and IPA phonemes, similar to ipa2kaldi.

Available

Pre-trained models are available directly from Mozilla and DeepSpeech Polyglot.

Chinese (zh-cn)
English (en)
- U.S. (en-us)
French (fr)
German (de)
Italian (it)
Polish (pl)
Spanish (es)

Accented Speech Recognition

A useful side effect of using IPA for speech to text models is the possibility of using an acoustic model trained for one language to recognized accented speech from a different language. Given a phoneme map, where phonemes from a source language are approximated in a target language, accented speech could be recognized as follows:

An acoustic model trained on language A is loaded
The lexicon for language B is transliterated to language A using the phoneme map
The acoustic model (A) and transliterated lexicon (B -> A) are used to transcribe speech

This approach has not yet been tested.

Text to Speech

Voice assistants need to talk back, and Rhasspy is no exception. Projects like opentts have collected a large number of voices and text to speech systems, but the quality varies between languages.

MozillaTTS and Larynx

The MozillaTTS project can produce natural-sounding voices given enough data and GPU computing power. A fork named Larynx has been created to make use of gruut and its IPA phoneme inventories.

Completed

Dutch (nl)
- rdh (GlowTTS)
English (en)
- U.S. (en-us)
  - kathleen (Tacotron2)
French (fr)
- siwis (GlowTTS)
German (de)
- thorsten (GlowTTS)
Russian (ru)
- nikolaev (GlowTTS)
Spanish (es-es)
- css10 (GlowTTS)

Not Completed

Arabic (ar)
Catalan (ca)
Chinese (zh-cn)
Czech (cs)
English (en)
- U.K. (en-gb)
Finnish (fi)
Greek (el)
Hebrew (he)
Italian (it)
Hindi (hi)
Hungarian (hu)
Japanese (ja)
Korean (ko)
Persian (fa)
Polish (pl)
Portuguese (pt-br)
Swedish (sv-se)
Ukranian (uk)
Vietnamese (vi-n)

Text Prompts

The audio corpora needed to train a text to speech voice is a bit different than a speech to text model. Rather than many different speakers in a wide variety of acoustic environments, you (typically) want a single speaker in a quiet environment with a high quality microphone. These datasets are harder to come by for languages other than English, so volunteers are being enlisted to produce them.

In order to reduce the burden on volunteers, sentences must be carefully chosen. Reading 15k sentences may produce an excellent voice, but few volunteers will be on board with spending weeks to complete all of them. Our goal is to find less than 2k sentences that are phonemically rich -- i.e., there are many examples of both individual phonemes as well as phoneme pairs.

gruut can find phonemically rich sentences in a large text corpus, such as Mozilla's Common Voice or Oscar. Importantly, these sentences are vetted by native speakers so that they (1) can be read without ambiguity, (2) are not offensive or nonsensical, and (3) are something a native speaker would actually say.

Completed

Available here

Dutch (nl)
English (en)
U.S. (en-us)
French (fr)
German (de)
Swedish (sv-se)

Not Completed

Arabic (ar)
Catalan (ca)
Chinese (zh-cn)
Czech (cs)
English (en)
U.K. (en-gb)
Finnish (fi)
Greek (el)
Hebrew (he)
Italian (it)
Hindi (hi)
Hungarian (hu)
Japanese (ja)
Korean (ko)
Persian (fa)
Polish (pl)
Portuguese (pt-br)
Russian (ru)
Spanish (es-es)
Ukranian (uk)
Vietnamese (vi-n)

Accented Text to Speech

Because all Larynx voices share the same underlying phoneme alphabet (IPA), it is possible to have voices speak in a different language than what they were trained for! All this requires is a phoneme map that tells Larynx how to approximate phonemes outside the voice's native inventory.

As an example, consider a French voice that is being used to speak U.S. English. The English word "that" can be phonemized as /ðæt/ but the French phoneme inventory (in gruut) does not have ð or æ. One approximation could be ð -> z and æ -> a, sounding more like "zat" to an English speaker.

With a complete phoneme map from language A (native to the voice) to B, accented speech can be achieved using the following process:

Text is tokenized and phonemized according to language B's rules and lexicon
Phonemes are mapped back to language A through the phoneme map
The voice is given the mapped phonemes to speak

Wake Word

Voice assistants are typical dormant until a special wake word is spoken. Audio after the wake word is then interpreted as a voice command.

Raven

There are a variety of free wake word systems available, but virtually none of them are fully open source. Raven is an exception -- it is free, open source, and can be trained with only 3 examples.

Unfortunately, Raven is not as fast or accurate as other wake word systems. We would like to rewrite Raven in a lower-level language like C++ or Rust. Other methods of wake word recognition should also be explored (like GRUs). Ideally, a GRU-based system could be trained on a fast machine (possibly with a GPU), and then executed directly in code like RNNoise or g2pE.

Audio Corpora

The following list contains links to free audio corpora that were available for download.

Arabic (ar)
- Arabic Speech Corpus
- Mozilla Common Voice
- Lingua Libre
- Sermon Online
Catalan (ca)
- Mozilla Common Voice
- VoxForge
Chinese (zh-cn)
Czech (cs)
- Vystadial VOIP
- Mozilla Common Voice
English (en)
- U.S. (en-us)
- U.K. (en-gb)
Dutch (nl)
- CGN
- CSS10
- MLS
- Spoken Wikipedia
- Lingua Libre
- Mozilla Common Voice
- VoxForge
- RDH
Finnish (fi)
- CSS10
French (fr)
- CSS10
- Lingua Libre
- MLS
- Mozilla Common Voice
- M-AILabs
- SIWIS
German (de)
- CSS10
- Spoken Wikipedia
- Tuda
- Zamia
- M-AILabs
- Thorsten
- Lingua Libre
- MLS
- Mozilla Common Voice
- VoxForge
Greek (el)
- CSS10
- Mozilla Common Voice
- Sermon Online
- VoxForge
Hebrew (he)
- CoSIH
- Lingua Libre
- Sermon Online
- VoxForge
Italian (it)
- MSPKA
- MLS
- Mozilla Common Voice
- VoxForge
- M-AILabs
Hindi (hi)
Hungarian (hu)
- CSS10
Japanese (ja)
- CSS10
- JSUT
- JVS
Korean (ko)
- Mozilla Common Voice
- KSS
- Zeroth
Persian (fa)
- Miras Voice
- Mozilla Common Voice
- Sermon Online
- VoxForge
Polish (pl)
- Lingua Libre
- M-AILabs
- MLS
- Mozilla Common Voice
Portuguese (pt-br)
- Edresson
- Falabrasil-LAPS
- Lingua Libre
- MLS
- Mozilla Common Voice
- Sermon Online
Russian (ru)
- CSS10
- Lingua Libre
- Mozilla Common Voice
- M-AILabs
- OpenSTT
- Russian LibriSpeech
- Sermon Online
- VoxForge
Spanish (es-es)
- CarlFM
- CSS10
- Librivox
- Lingua Libre
- M-AILabs
- MLS
- Mozilla Common Voice
- VoxForge
Swedish (sv-se)
- Mozilla Common Voice
- NST
Ukranian (uk)
- M-AILabs
- VoxForge
- Ukranian Open Speech to Text Dataset
Vietnamese (vi-n)
- FOSD
- Sermon Online
- VAIS 1000
- VIVOS

Librivox

A potential source of audio data is Librivox, which includes free audiobooks for public domain books.

To make this data useful for model training, it must be split into individual sentences and aligned with its text. The aeneas tool may help with this process.