snakers4 30 мар 2021 в 03:33

High-Quality Text-to-Speech Made Accessible, Simple and Fast

8 мин

11K

Natural Language Processing*ЗвукМашинное обучение*

There is a lot of commotion in text-to-speech now. There is a great variety of toolkits, a plethora of commercial APIs from GAFA companies (based both on new and older technologies). There are also a lot of Silicon Valley startups trying to ship products akin to "deep fakes" in speech.

But despite all this ruckus we have not yet seen open solutions that would fulfill all of these criteria:

Naturally sounding speech;
A large library of voices in many languages;
Support for 16kHz and 8kHz out of the box;
No GPUs / ML engineering team / training required;
Unique voices not infringing upon third-party licenses;
High throughput on slow hardware. Decent performance on one CPU thread;
Minimalism and lack of dependencies. One-line usage, no builds or coding in C++ required;
Positioned as a solution, not yet another toolkit / compilation of models developed by other people;
Not affiliated by any means with ecosystems of Google / Yandex / Sberbank;

We decided to share our open non-commercial solution that fits all of these criteria with the community. Since we have published the whole pipeline we do not focus much on cherry picked examples and we encourage you to visit our project GitHub repo to test our TTS for yourself.

Solutions Review

This summary is not supposed to provide an in-depth technical overview of all available solutions. We just want to do a brief introductory summary of the available approaches. We do not list numerous toolkits in favor of more or less batteries-included solutions with a decent library of voices and some kind of support / community:

Concatenative models. The only project I found that is somehow maintained and alive and it's possible to run it "as-is" without archaeological excavations, is rhvoice (there are entire forums dedicated to running TTS voices from Windows, but this can hardly be called a supported solution). When I tested this repo, it was essentially abandoned, but then it got a new "owner". The main advantage here is low compute requirements (excluding human resources to make it work and maintain it). The main disadvantage — it sounds like a metallic robotic voice. A less obvious disadvantage is that it's quite difficult to estimate the cost of ownership. Sound quality: 3+ on a five-point scale;
The modern DL-based models approach is essentially to separate the end-to-end TTS task into two subtasks: text -> features and features -> speech (vocoding). Typically Tacotron2 is used for the first subtask. There is a plethora of different models ranging by their compute requirements:
- Tacotron2 + WaveNet (the original WaveNet accepted linguistics features as input, but for tacotron it was changed to more convenient melspectrograms). The main problem is a very low inference speed due to the autoregressiveness of the model and its computational complexity. It is also prohibitively expensive to train this one. Sound quality: 4+;
- Tacotron2 + WaveRNN (also modified to accept spectrograms). Vocoder is noticeably faster than the previous one: using all the hacks you can even get real-time synthesis without a GPU, although the naturalness of the sound will decrease; sound quality: 3.5-4
- Tacotron2 + Parallel WaveNet. The slow vocoder mentioned above was used as a teacher model to train a new accelerated parallel vocoder model, capable to synthesize audio faster than real-time, but still demanding powerful GPUs. Besides, distillation process itself adds the disadvantages: it requires a high-fidelity teacher model and an appropriate training scheme. Sound quality: 4+;
- Tacotron2 + multi-band WaveRNN. It's also a development of the previous ideas, and parallelization in a sense — here synthesis is faster than real-time on the CPU. The aforementioned paper is not so popular, so there are not many implementations, although some approaches were clever and have been successfully applied in the further models. Sound quality: 3.5-4+;
- Tacotron2 + LPCNet. An interesting combination of DL and classical algorithms, which can speed up inference enough for production on CPU, but requires a lot of work to essentially decrypt the authors' code for high-quality results. Sound quality: 3.5-4+;
- Numerous solutions based on Nvidia's Tacotron2 + Waveglow as the current standard for speech synthesis tasks. No one tells about about their "secret sauce" (for example how 15.ai creates a voice based on 15 minutes of ground truth or how many models there are in their pipeline). Synthesis may sound indistinguishable from the real people's voices on the cherry-picked examples, but when you look at the real models from the community, the quality varies markedly, and the details of the improved solutions are not disclosed. Architecturally, there are no complaints about the tacotron and its analogs in terms of speed and cost of ownership, but Waveglow is very compute intensive training and in production, which makes its use essentially impractical and prohibitive. Sound quality: 3.5-4+;
- Replacing Tacotron2 => FastSpeech / FastSpeech 2 / FastPitch, that is, choosing a simpler feed-forward architecture instead of a recurrent one (based on forced-align from Tacotron and a million more tricky and complex options). It gives control of the speech tempo and voice pitch, which is quite practical, generally simplifies and makes the final architecture more modular.Sound quality: 3.5-4+;

Quality Assessment and Audio Examples

We decided to keep the quality assessment really simple: we generated audio from the validation subsets of our data (~200 files per speaker), shuffled them with the original recorded audios of the same speakers, and gave it to a group of 24 assessor to evaluate the sound quality on a five-point scale. For 8kHz and 16kHz the scores were collected separately (both for synthesized and original speech). For simplicity we had the following grades — [1, 2, 3, 4-, 4, 4+, 5-, 5] — the higher the quality the more detailed our scale is. Then, for each speaker, we simply calculated the mean.

In total people scored audios 37,403 times. 12 people annotated the whole dataset. 12 other people managed to annotate from 10% to 75% of audios. For each speaker we calculated mean (standard deviation is shown in brackets). We also tried first calculating median scores for each audio and then averaging them. But this just increases the mean values without affecting the ratios, so we just used plain averages in the end. The key metric here of course is the ratio between the mean score for synthesis vs the original audio. Some users had much lower scores overall (hence high dispersion), but we decided to keep all scores as is without cleaning outliers.

Speaker	Original	Synthesis	Ratio	Examples
aidar_8khz	4.67 (.45)	4.52 (.55)	96.8%	link
baya_8khz	4.52 (.57)	4.25 (.76)	94.0%	link
kseniya_8khz	4.80 (.40)	4.54 (.60)	94.5%	link
aidar_16khz	4.72 (.43)	4.53 (.55)	95.9%	link
baya_16khz	4.59 (.55)	4.18 (.76)	91.1%	link
kseniya_16khz	4.84 (.37)	4.54 (.59)	93.9%	link

We asked our assessors to rate the "naturalness of the speech" (not the audio quality). Nevertheless we were surprised that based on anecdotes people cannot tell 8 kHz from 16 kHz on their everyday devices (which is also confirmed by metrics). Baya has the lowest absolute and relative scores. Kseniya has the highest absolute scores, Aidar has the highest relative scores. Baya also has higher score dispersion.

Manually inspecting audios with high score dispersion reveals several patterns. Speaker errors, tacotron errors (pauses), proper names and hard-to-read words are the most common causes. Of course 75% of such differences are in synthesized audios and sampling rate does not seem to affect it.

We tried to rate "naturalness". But it is only natural to try estimating "unnaturalness" or "robotness" as well. It can be measured by asking people to choose between to audios. But we went one step beyond and essentially applied a double blind test. We asked our assessors to rate the same audio 4 times in random order — original and synthesis with different sampling rates. For assessors who annotated the whole dataset we calculated the following table:

Comparison	Worse	Same	Better
16k vs 8k, original	957	4811	1512
16k vs 8k, synthesis	1668	4061	1551
Original vs synthesis, 8k	816	3697	2767
Original vs synthesis, 16k	674	3462	3144

Several conclusions can be drawn:

In 66% of cases people cannot hear difference between 8k и 16k;
In synthesis 8k helps to hide some errors;
In about 60% of cases synthesis is same or better than the original;
Two last conclusions hold regardless of the sampling rate, 8k having a slight advantage;

You can see for yourself how it sounds, both for our unique voices and for speakers from external sources (more audio for each speaker can be synthesized in the colab notebook in our repo.

If you're unfamiliar with colab notebooks or you just want a quick listen, here are some random audios for our voices:

Aidar:

Baya:

Kseniya:

Once again, please note that these ones are not cherry-picked examples, but how the synthesis actually sounds.

Speed Benchmarks

Speed is the next important defining property of the model, and to measure the speed of synthesis we use the following simple metrics:

RTF (Real Time Factor) — time the synthesis takes divided by audio duration;
RTS = 1 / RTF (Real Time Speed) — how much the synthesis is "faster" than real-time;

We benchmarked the models on two devices using Pytorch 1.8 utils:

CPU — Intel i7-6800K CPU @ 3.40GHz;
GPU — 1080 Ti;
When measuring CPU performance, we also limited the number of threads used;

For the 16KHz models we got the following metrics:

BatchSize	Device	RTF	RTS
1	CPU 1 thread	0.7	1.4
1	CPU 2 threads	0.4	2.3
1	CPU 4 threads	0.3	3.1
4	CPU 1 thread	0.5	2.0
4	CPU 2 threads	0.3	3.2
4	CPU 4 threads	0.2	4.9
---	-----------	---	---
1	GPU	0.06	16.9
4	GPU	0.02	51.7
8	GPU	0.01	79.4
16	GPU	0.008	122.9
32	GPU	0.006	161.2
---	-----------	---	---

For the 8KHz models we got the following metrics:

BatchSize	Device	RTF	RTS
1	CPU 1 thread	0.5	1.9
1	CPU 2 threads	0.3	3.0
1	CPU 4 threads	0.2	4.2
4	CPU 1 thread	0.4	2.8
4	CPU 1 threads	0.2	4.4
4	CPU 4 threads	0.1	6.6
---	-----------	---	---
1	GPU	0.06	17.5
4	GPU	0.02	55.0
8	GPU	0.01	92.1
16	GPU	0.007	147.7
32	GPU	0.004	227.5
---	-----------	---	---

A number of things surprised us during benchmarking:

AMD processors performed much worse;
The bottleneck in our case was the tacotron, not the vocoder (there is still a lot of potential to make the whole model 3-4x faster, maybe even 10x);
More than 4 CPU threads or batch size larger than 4 do not help;

Available Speakers

For simplicity we decided to publish all our models as part of silero-models. The full list of current models can always be found in this yaml file.

At the time of this writing, the following voices are supported (for each speaker _16khz and _8khz versions of voices are available):

Speaker	Gender	Language	Source	Dataset License	Examples
aidar	m	ru	`Silero`	Private	8000 / 16000
baya	f	ru	`Silero`	Private	8000 / 16000
kseniya	f	ru	`Silero`	Private	8000 / 16000
irina	f	ru	Private contribution	TBD	8000 / 16000
natasha	f	ru	source	CC BY 4.0	8000 / 16000
ruslan	m	ru	source	CC BY-NC-SA 4.0	8000 / 16000
lj	f	en	source	Public Domain	8000 / 16000
thorsten	m	de	source	Creative Commons Zero v1.0 Universal	8000 / 16000
gilles	m	fr	source	Public Domain	8000 / 16000
tux	m	es	source	Public Domain	8000 / 16000

How To Try It

All models are published in silero-models repository, there are also examples of launching the synthesis in colab. For completeness, here is a minimalistic example:

import torch

language = 'ru'
speaker = 'kseniya_16khz'
device = torch.device('cpu')

(model,
 symbols,
 sample_rate,
 example_text,
 apply_tts) = torch.hub.load(repo_or_dir='snakers4/silero-models',
                                          model='silero_tts',
                                          language=language,
                                          speaker=speaker)

model = model.to(device)  # gpu or cpu
audio = apply_tts(texts=[example_text],
                  model=model,
                  sample_rate=sample_rate,
                  symbols=symbols,
                  device=device)

The following special characters are currently supported: !\'(),.:;?¡¿. In addition, for most speakers of the Russian language, accent marks were used in the text for voicing (the + symbol before the stressed vowel — while testing such models you still need to put the accent manually):

Speaker	With stress
aidar	yes
baya	yes
ksenia	yes
irina	yes
natasha	yes
ruslan	yes
lj	no
thorsten	no
gilles	no
tux	no

In future we plan to convert all models to a simpler and more unified input format that does not require accents. To avoid confusion, yml file, which describes all our models, explicitly specifies a set of tokens for each model, and an example phrase to generate.

Philosophy, License, and Motivation

As model authors, we consider the following rules for using models to be fair:

Any of the models described above cannot be used in commercial products;
Voices from external sources are provided for demonstration purposes only;
The silero-models repository is published under the GNU A-GPL 3.0 license. Legally speaking this does not prohibit commercial use. But commercial solutions with fully open code under the same license are rare (which is required by this license);
If your goal is to make non-commercial use of our models for the benefit of the society — we will be glad to help you with the integration of our models;
If you are planning to try models for personal use — you are encouraged to share the results of your experiments in the repository;
If you are willing to incorporate our models into non-commercial products for people with speech or sight impairments — you are welcome to reach out, we will be happy to do what we can;

The main goal of this project was to build a modern TTS system that meets the criteria described above.

Further Work

We plan to develop and improve our models, in particular:

Continue to work on the quality and naturalness of the sound and expand the library of voices;
Sooner or later add support for voice speed and pitch;
Making our models 3-4 times faster is still possible;
It is unlikely, but still possible that sooner or later we will be able to add a multi-speaker model or voice-transfer;

Tongue Twisters

And as a bonus, here are some tongue twisters.

Russian:

Other languages:

Хабы: