We are proud to announce that we have built from ground up and released our high-quality (i.e. on par with premium Google models) speech-to-text Models for the following languages:

English;
German;
Spanish;

You can find all of our models in our repository together with examples, quality and performance benchmarks. Also we invested some time into making our models as accessible as possible — you can try our examples as well as PyTorch, ONNX, TensorFlow checkpoints. You can also load our model via TorchHub.

	PyTorch	ONNX	TensorFlow	Quality
English (en_v1)	✓	✓	✓	link
German (de_v1)	✓	✓	✓	link
Spanish (es_v1)	✓	✓	✓	link

Why This is a Big Deal

Speech-to-text has traditionally had high barriers of entry due to a number or reasons:

Hard-to-collect data;
Costly annotation and high data requirements;
High compute requirements and adoption of obsolete hard to use technologies;

Here are some of the typical problems that existing ASR solutions and approaches had before our release:

STT Research typically focused on huge compute budgets;
Pre-trained models and recipes did not generalize well, were difficult to use even as-is, relied on obsolete tech;
Until now STT community lacked easy to use high quality production grade STT models;

First we tried to alleviate some of these problems for the community by publishing the largest Russian spoken corpus in the world (see our Habr post here). Now we try to solve these problems as follows:

We publish a set of pre-trained high-quality models for popular languages;
Our models are designed to be as robust to different domains as you can see in our benchmarks;
Our models are pre-trained on vast and diverse datasets;
Our models are fast and can be run on commodity hardware;
Our models are easy to use;

Embarrassing Simplicity

We believe that modern technology should be embarrassingly simple to use. In our work we follow these design principles:

Models should be compact and fast;
Models should generalize across domains, there should be one general solution tailored superficially to particular domains, not vice-versa;
Models should be easy to use;

Further plans

Now the smallest we could compress our models is around 50 Megabytes.
We still have plans to compress our Enterprise Edition models up to ~20 Megabytes without loss of fidelity.
We also are planning to release Community Edition model for other popular languages.

Modern Google-level STT Models Released

Why This is a Big Deal

Embarrassing Simplicity

Further plans

Links

Articles