Our new public speech synthesis in super-high quality, 10x faster and more stable / Habr

hero_image

In our last article we made a bunch of promises about our speech synthesis.

After a lot of hard work we finally have delivered upon these promises:

Model size reduced 2x;
New models are 10x faster;
We added flags to control stress;
Now the models can make proper pauses;
High quality voice added (and unlimited "random" voices);
All speakers squeezed into the same model;
Input length limitations lifted, now models can work with paragraphs of text;
Pauses, speed and pitch can be controlled via SSML;
Sampling rates of 8, 24 or 48 kHz are supported;
Models are much more stable — they do not omit words anymore;

This is a truly break-through achievement for us and we are not planning to stop anytime soon. We will be adding as many languages as possible shortly (the CIS languages, English, European languages, Hindic languages). Also we are still planning to make our models additional 2-5x faster.

We are also planning to add phonemes and a new model for stress, as well as to reduce the minimum amount of audio required to train a high-quality voice to 5 — 15 minutes.

As usual you can try our model in our repo or in colab.

Quickstart

Here are the non cherry-picked model audio samples:

As usual you can find all of the necessary instructions and models:

In our public repo. You need V3 models;
You can try models directly in colab;

And here is the minimalistic model invocation example:

import torch

device = torch.device('cpu')
torch.set_num_threads(4)
speaker = 'xenia'  # 'aidar', 'baya', 'kseniya', 'xenia', 'random'
sample_rate = 48000  # 8000, 24000, 48000

model, example_text = torch.hub.load(repo_or_dir='snakers4/silero-models',
                                     model='silero_tts',
                                     language='ru',
                                     speaker='ru_v3')
model.to(device)
audio = model.apply_tts(text=example_text,
                        speaker=speaker,
                        sample_rate=sample_rate)

You can always list model speakers and symbols via model.speakers and model.symbols.

SSML tags support

We used to limit model input to 140 characters. Now this limitation is lifted.

The following SSML tags are supported:

Tag	Example	Accepted values
Pause	`<break time="2000ms"/>`	`5s`, `500ms`
Speed	`<prosody rate="x-fast"> … </prosody>`	`x-slow`, `slow`, `medium`, `fast`, `x-fast`
Pitch	`<prosody pitch="x-high"> … </prosody>`	`x-low`, `low`, `medium`, `high`, `x-high`, `robot`
Sentence	`<s> … </s>`	-
Paragraph	`<p> … </p>`	-

More detailed documentation about supported SSML tags can be found here.

You can hear all of the main tags in action:

See main SSML tags in action:

ssml_sample = """
              <speak>
              <p>
                  Когда я просыпаюсь, <prosody rate="x-slow">я говорю довольно медленно</prosody>.
                  Потом я начинаю говорить своим обычным голосом,
                  <prosody pitch="x-high"> а могу говорить тоном выше </prosody>,
                  или <prosody pitch="x-low">наоборот, ниже</prosody>.
                  Потом, если повезет – <prosody rate="fast">я могу говорить и довольно быстро.</prosody>
                  А еще я умею делать паузы любой длины, например две секунды <break time="2000ms"/>.
                  <p>
                    Также я умею делать паузы между параграфами.
                  </p>
                  <p>
                    <s>И также я умею делать паузы между предложениями</s>
                    <s>Вот например как сейчас</s>
                  </p>
              </p>
              </speak>
              """

sample_rate = 48000
speaker = 'xenia'              
audio = model.apply_tts(ssml_text=ssml_sample,
                        speaker=speaker,
                        sample_rate=sample_rate)