The Master Plan

To keep pace in the short term and grow in the long term, the Rhasspy project needs to train its own models instead of scavenging the Internet for them. With the plethora of free audio corpora available, it's possible to create high quality speech to text and text to speech voices for most of Rhasspy's supported languages!

An over-arching theme in Rhasspy's Master Plan is the use of the International Phonetic Alphabet (IPA). All speech to text models and text to speech voices are trained with IPA phoneme inventories and lexicons.

Languages

  • Arabic (ar)
  • Catalan (ca)
  • Chinese (zh-cn)
  • Czech (cs)
  • English (en)
    • U.S. (en-us)
    • U.K. (en-gb)
  • Dutch (nl)
  • Finnish (fi)
  • French (fr)
  • German (de)
  • Greek (el)
  • Hebrew (he)
  • Italian (it)
  • Hindi (hi)
  • Hungarian (hu)
  • Japanese (ja)
  • Korean (ko)
  • Persian (fa)
  • Polish (pl)
  • Portuguese (pt-br)
  • Russian (ru)
  • Spanish (es-es)
  • Swedish (sv-se)
  • Ukranian (uk)
  • Vietnamese (vi-n)

Phoneme Inventory and Text Processing

Before a speech to text model or a text to speech voice can be trained with available audio corpora, several things are needed for each language:

With all of the pieces, we can go from raw text (UTF-8) to phonemes (UTF-8 IPA) through this pipeline:

  1. Raw text is split into tokens
  2. Numbers, etc. are expanded into words
  3. Words in the lexicon are looked up and turned into phonemes
  4. Words not in the lexicon have their pronunciations guessed using g2p model

A currently missing piece involves step 3: some words have multiple pronunciations, and context may be necessary to determine the appropriate pronunciation. For example, the English word "read" may be pronounced like "red" (I read the book) or "reed" (I like to read) depending on context. gruut sidesteps this problem for now by letting you explicitly mark pronunciations as "word(n)" where "n" is the nth pronunciation in the lexicon.

In the future, we would like to pre-trained word vectors and alignments generated during the Kaldi training process to train language-specific classifiers for guessing which word pronunciation is appropriate given a word's context.

Completed

Available here

  • Czech (cs)
  • Dutch (nl)
  • English (en)
    • U.S. (en-us)
    • U.K. (en-gb)
  • French (fr)
  • German (de)
  • Greek (el)
  • Hebrew (he)
    • Right to left writing system
  • Italian (it)
  • Persian (fa)
    • Right to left writing system
    • Using hazm for genitive case (e̞)
  • Portuguese (pt-br)
  • Russian (ru)
  • Spanish (es-es)
  • Swedish (sv-se)
  • Vietnamese (vi-n)

Not Completed

  • Arabic (ar)
    • Right to left writing system
  • Catalan (ca)
  • Chinese (zh-cn)
    • Needs intelligent tokenizer
  • Finnish (fi)
  • Hindi (hi)
  • Hungarian (hu)
  • Japanese (ja)
    • Multiple writing systems
    • Needs intelligent tokenizer
  • Korean (ko)
  • Polish (pl)
  • Ukranian (uk)

Speech to Text

Transcribing speech to text is a core function of Rhasspy. High quality models are a plus (low word error rate), but even models with moderate quality can be quite useful because Rhasspy is intended to recognize a limited set of pre-scripted voice commands.

Kaldi

The ipa2kaldi project creates a Kaldi nnet3 recipe from one or more audio datasets using gruut to tokenize and phonemize the dataset transcriptions. The IPA phoneme inventories are used directly as well.

Completed

Not Completed

  • Arabic (ar)
  • Catalan (ca)
  • Chinese (zh-cn)
  • English (en)
    • U.S. (en-us)
    • U.K. (en-gb)
  • Dutch (nl)
  • Finnish (fi)
  • German (de)
  • Greek (el)
  • Hebrew (he)
  • Hindi (hi)
  • Hungarian (hu)
  • Japanese (ja)
  • Korean (ko)
  • Persian (fa)
  • Polish (pl)
  • Portuguese (pt-br)
  • Swedish (sv-se)
  • Ukranian (uk)
  • Vietnamese (vi-n)

Mozilla DeepSpeech

Besides Kaldi, Mozilla's DeepSpeech is another useful speech to text tool. We would like to produce a fork that uses gruut and IPA phonemes, similar to ipa2kaldi.

Available

Pre-trained models are available directly from Mozilla and DeepSpeech Polyglot.

  • Chinese (zh-cn)
  • English (en)
    • U.S. (en-us)
  • French (fr)
  • German (de)
  • Italian (it)
  • Polish (pl)
  • Spanish (es)

Accented Speech Recognition

A useful side effect of using IPA for speech to text models is the possibility of using an acoustic model trained for one language to recognized accented speech from a different language. Given a phoneme map, where phonemes from a source language are approximated in a target language, accented speech could be recognized as follows:

  1. An acoustic model trained on language A is loaded
  2. The lexicon for language B is transliterated to language A using the phoneme map
  3. The acoustic model (A) and transliterated lexicon (B -> A) are used to transcribe speech

This approach has not yet been tested.


Text to Speech

Voice assistants need to talk back, and Rhasspy is no exception. Projects like opentts have collected a large number of voices and text to speech systems, but the quality varies between languages.

MozillaTTS and Larynx

The MozillaTTS project can produce natural-sounding voices given enough data and GPU computing power. A fork named Larynx has been created to make use of gruut and its IPA phoneme inventories.

Completed

Not Completed

  • Arabic (ar)
  • Catalan (ca)
  • Chinese (zh-cn)
  • Czech (cs)
  • English (en)
    • U.K. (en-gb)
  • Finnish (fi)
  • Greek (el)
  • Hebrew (he)
  • Italian (it)
  • Hindi (hi)
  • Hungarian (hu)
  • Japanese (ja)
  • Korean (ko)
  • Persian (fa)
  • Polish (pl)
  • Portuguese (pt-br)
  • Swedish (sv-se)
  • Ukranian (uk)
  • Vietnamese (vi-n)

Text Prompts

The audio corpora needed to train a text to speech voice is a bit different than a speech to text model. Rather than many different speakers in a wide variety of acoustic environments, you (typically) want a single speaker in a quiet environment with a high quality microphone. These datasets are harder to come by for languages other than English, so volunteers are being enlisted to produce them.

In order to reduce the burden on volunteers, sentences must be carefully chosen. Reading 15k sentences may produce an excellent voice, but few volunteers will be on board with spending weeks to complete all of them. Our goal is to find less than 2k sentences that are phonemically rich -- i.e., there are many examples of both individual phonemes as well as phoneme pairs.

gruut can find phonemically rich sentences in a large text corpus, such as Mozilla's Common Voice or Oscar. Importantly, these sentences are vetted by native speakers so that they (1) can be read without ambiguity, (2) are not offensive or nonsensical, and (3) are something a native speaker would actually say.

Completed

Available here

  • Dutch (nl)
  • English (en)
  • U.S. (en-us)
  • French (fr)
  • German (de)
  • Swedish (sv-se)

Not Completed

  • Arabic (ar)
  • Catalan (ca)
  • Chinese (zh-cn)
  • Czech (cs)
  • English (en)
  • U.K. (en-gb)
  • Finnish (fi)
  • Greek (el)
  • Hebrew (he)
  • Italian (it)
  • Hindi (hi)
  • Hungarian (hu)
  • Japanese (ja)
  • Korean (ko)
  • Persian (fa)
  • Polish (pl)
  • Portuguese (pt-br)
  • Russian (ru)
  • Spanish (es-es)
  • Ukranian (uk)
  • Vietnamese (vi-n)

Accented Text to Speech

Because all Larynx voices share the same underlying phoneme alphabet (IPA), it is possible to have voices speak in a different language than what they were trained for! All this requires is a phoneme map that tells Larynx how to approximate phonemes outside the voice's native inventory.

As an example, consider a French voice that is being used to speak U.S. English. The English word "that" can be phonemized as /ðæt/ but the French phoneme inventory (in gruut) does not have ð or æ. One approximation could be ð -> z and æ -> a, sounding more like "zat" to an English speaker.

With a complete phoneme map from language A (native to the voice) to B, accented speech can be achieved using the following process:

  1. Text is tokenized and phonemized according to language B's rules and lexicon
  2. Phonemes are mapped back to language A through the phoneme map
  3. The voice is given the mapped phonemes to speak

Wake Word

Voice assistants are typical dormant until a special wake word is spoken. Audio after the wake word is then interpreted as a voice command.

Raven

There are a variety of free wake word systems available, but virtually none of them are fully open source. Raven is an exception -- it is free, open source, and can be trained with only 3 examples.

Unfortunately, Raven is not as fast or accurate as other wake word systems. We would like to rewrite Raven in a lower-level language like C++ or Rust. Other methods of wake word recognition should also be explored (like GRUs). Ideally, a GRU-based system could be trained on a fast machine (possibly with a GPU), and then executed directly in code like RNNoise or g2pE.


Audio Corpora

The following list contains links to free audio corpora that were available for download.

Librivox

A potential source of audio data is Librivox, which includes free audiobooks for public domain books.

To make this data useful for model training, it must be split into individual sentences and aligned with its text. The aeneas tool may help with this process.