High-Quality Text-to-Speech Made Accessible, Simple and Fast / Comments / Habr

UFO landed and left these words here

snakers4 Mar 30 2021 at 05:43

Yeah, this is hilarious that we did not catch this during proof-reading

robert_ayrapetyan Mar 31 2021 at 02:19

But on a github seems you provide both?

snakers4 Mar 31 2021 at 03:10

Yes, kind of

snakers4 Apr 2 2021 at 12:01

For people from the future, who will be reading this:

no handling of numbers, those are just omitted

there is no text normalization middleware packaged with the models
the model just produces audio from text
it was not included by design

in future releases stress will be handled automatically

issues with longer sentences, interference just stops (might be related to warning that sentence has more than 140 chars) or is getting worse at the end of longer

this is also by design
model accepts sentences and it can work with batches
see these examples

import torch
import torchaudio

language = 'ru'
speaker = 'kseniya_16khz'
device = torch.device('cpu')
model, symbols, sample_rate, example_text, apply_tts = torch.hub.load(repo_or_dir='snakers4/silero-models',
                                                                      model='silero_tts',language=language,speaker=speaker)
model = model.to(device)  # gpu or cpu

example_text="нав+ерное, существ+уют друг+ие рец+епты, но я их не зн+аю. +или он+и мне не помог+ают. х+очешь моег+о сов+ета - пож+алуйста: сад+ись раб+отать. сл+ава б+огу, так+им л+юдям, как мы с тоб+ой, для раб+оты ничег+о не н+ужно кр+оме бум+аги и карандаш+а."

for i, text in enumerate(example_text.split('. ')):
  audio = apply_tts(texts=[text],
                    model=model,
                    sample_rate=sample_rate,
                    symbols=symbols,
                    device=device)
  torchaudio.save(f'test_{str(i).zfill(2)}.wav',
                  audio[0].unsqueeze(0),
                  sample_rate=16000,
                  bits_per_sample=16)

import torch
import torchaudio

language = 'ru'
speaker = 'kseniya_16khz'
device = torch.device('cpu')
model, symbols, sample_rate, example_text, apply_tts = torch.hub.load(repo_or_dir='snakers4/silero-models',
                                                                      model='silero_tts',language=language,speaker=speaker,
                                                                      force_reload=True)
model = model.to(device)  # gpu or cpu

example_text="нав+ерное, существ+уют друг+ие рец+епты, но я их не зн+аю. +или он+и мне не помог+ают. х+очешь моег+о сов+ета - пож+алуйста: сад+ись раб+отать. сл+ава б+огу, так+им л+юдям, как мы с тоб+ой, для раб+оты ничег+о не н+ужно кр+оме бум+аги и карандаш+а."
example_text = example_text.split('. ')

print(example_text)
audio = apply_tts(texts=example_text,
                  model=model,
                  sample_rate=sample_rate,
                  symbols=symbols,
                  device=device)

d1gital_love Nov 17 2021 at 05:35

Про In synthesis 8k helps to hide some errors:

90g_best_0_20210306-003223aidar_16000.wav - первые 4 секунды чистые, а на 5 секунде какое-то сбитне голоса.

90g_best_0_20210306-003223aidar_8000.wav - такого скачка на 5 секунде нет. Общее качество плохое, 8000 Hz.

Слушал без спектрограмм.

d1gital_love Nov 17 2021 at 06:02

0000.40_16000g_best_0_20210315-154007irina_16000.wav нет слов после 3 секунды

0008.56_16000g_best_0_20210315-154007irina_16000.wav, 0003.68_16000g_best_0_20210315-154007irina_16000.wav неестественные звуки, трудно словами описать

0002.72_16000g_best_0_20210315-154007irina_16000.wav какое-то металлическое эхо с малой задержкой

d1gital_love Nov 17 2021 at 06:02

kseniya 16000 почти то, что надо. Очень небольшие неточности.

High-Quality Text-to-Speech Made Accessible, Simple and Fast

Comments 8

Articles