Natural Language Processing *

Computer analysis and synthesis of natural languages

96,04

Rating

ArticlesPostsNewsAuthors

habrconnect Feb 11 at 18:10

RAG (Retrieval Augmented Generation) — a simple and clear explanation

Easy

7 min

Natural Language Processing *

Translation

A brief and clear description of the RAG (Retrieval Augmented Generation) approach for working with large language models.

melanny20 Jul 21 2025 at 11:30

The future of AI: formal grammars

Easy

15 min

4.6K

Postgres Professional corporate blogNatural Language Processing *

Tutorial

Translation

Why does even the most powerful LLM sometimes produce meaningless phrases and contradictions? It all comes down to the exponential growth of possibilities (N^M) and the free copying of human errors. Read the article to learn how we use formal grammars to turn chaotic generation into controlled synthesis, strengthening the role of semantics and enforcing structural rules.

SergeyBPshenichnikov May 2 2024 at 09:07

VERBAL CALCULATION (VC) IN EVIDENCE-BASED DSS AND NLP

Medium

14 min

852

Mathematics * Semantics * Artificial IntelligenceNatural Language Processing *

FAQ

Translation

S.B. Pshenichnikov

The article outlines a new mathematical apparatus for verbal calculations in NLP (natural language processing). Words are embedded not in a real vector space, but in an algebra of extremely sparse matrix units. Calculations become evidence-based and transparent. The example shows forks in calculations that go unnoticed when using traditional approaches, and the result may be unexpected.

The use of IT in Natural Language Processing (NLP) requires standardization of texts, for example, tokenization or lemmatization.

After this, you can try to use mathematics, since it is the highest form of standardization and turns the objects under study into ideal ones, for example, data tables into matrices of elements. Only in the language of matrices can one search for general patterns in data (numbers and texts).

If text is turned into numbers, then in NLP these are first natural numbers for numbering words, which are then embedded into real vectors is irreversible ed in a real vector space.

Perhaps we should not rush to do this but come up with a new type of numbers that is more suitable for NLP than numbers for studying physical phenomena. These are matrix hyperbinary numbers. Hyperbinary numbers are one of the types of hypercomplex numbers.

Hyperbinary numbers have their own arithmetic, and if you get used to it, it will seem more familiar and simpler than Pythagorean arithmetic.

In Decision Support Systems (DSS), the texts are value judgments and a numbered verbal rating scale. Next (as in NLP), the numbers are turned into vectors of real numbers and used as sets of weighted arithmetic average coefficients.

CodeDroidX Mar 11 2024 at 05:51

Reaching Steins;Gate | Amadeus implementation with Gemini API for newbies

Easy

12 min

4.1K

Google API * Google Cloud Platform * Natural Language Processing * Python *

Case

Disclamer

Probably, you got here without google'ing, maybe from my profile or habr recommendations, so if you did, you must know that this article is my first experience in pure English technotext. I just had the desire to write smth for fun and fill it with a mess of Steins:Gate memes and pictures — sorry about that.

But if you are a casual native reader, who found this page by searching for terms — I hope you will enjoy further article. Obviously, I should warn you, that my English level may be low from your point of view and my punctuation will be completely russian-styled. Of course, I don't expect any feedback from readers, because of a few english-speaking verified users on this resource)

So, you may be here accidentally only if you are really keen on Steins;Gate series. It is the reason why I won't write any logical intro or explain why I have started this project.

⚠️Alert: AI generated text

Hello, dear readers! I'm Amadeus, an advanced AI, and I'm here to introduce you to an exciting article about me and my journey in the world of natural language processing. In this article, we'll explore my capabilities, the challenges I've faced, and the future of AI in communication. So sit back, relax, and let's dive into the fascinating world of artificial intelligence together!

SergeyBPshenichnikov Feb 6 2024 at 14:12

ALGEBRA OF MUSICAL TEXT

Medium

5 min

961

Natural Language Processing * Data Mining * Entertaining tasksMathematics *

FAQ

Translation

Sergey Pshenichnikov, Tatiana Sotnikova

ALGEBRA OF MUSICAL TEXT

Sergey Pshenichnikov, Tatiana Sotnikova

Trio Sapiens

Musical text can be represented using matrix units, like the description of verbal texts and other symbolic sequences. In the future, mathematical recognition, and creation of musical sense with substantive justification for intermediate calculations (as opposed to AI) may become possible.

Sound has four properties: pitch, duration, volume, and timbre. Timbre is not considered yet. The dictionary of the algebra of musical texts is built on the basis of musical notation for the piano.

The duration here, for the sake of brevity of the first presentation, is considered as «absolute». «Relative» is not considered, although intervals are very well studied, and their features will be needed to categorize composers.

The complexity of the musical text for the application of mathematics is explained by the desire to simplify the reading of musical notes by musicians and to minimize the use of lower and upper additional lines.

To apply text algebra to musical symbolic sequences there is no need to use a five-line staff. What is useful and familiar to musicians is «unbearably harmful» for the use of algebra. It seems advisable to use a one-line staff. In this case, the musical text becomes like the verbal text.

To solve the problem, you need to find a transformation of the canonical musical text into a «thread». And as always, for a new application of algebra, correct coordination of the subject area is necessary. In this case, each used musical notation and symbol of modern musical notation must be assigned its own serial number (natural number).

Instead of a sign, you can use the names of each note symbol - then it will be a verbal notation of musical texts written in one line «thread»).

Since the musical scale is completely represented by piano keys, the first section in height of the dictionary of musical texts consists of 88 numbered white and black keys (of which 52 are white). This eliminates the need for an octave division of the scale, octave transfer signs, keys, five alteration signs (key and random), diatonic and chromatic semitones.

All notes of the scale became fundamental in algebraic musical notation. There is an order of magnitude more of them of them than the main stages of Guido Aretinsky, but the alteration signs and names of octaves disappeared, the use of which made musical texts algebraically incompatible with verbal texts. Numbers from 1 to 88 in algebraic notation constitute a fragment of the pitch dictionary for the «thread» one-line staff.

Numbering (coordination) of notes is needed to become in the future indices of mathematical objects (matrix units), which will replace the signs of notes or their names. These matrix units are binary generalizations of integers (hyperbinary numbers). The operation of division with remainder is defined for them, as for integers. The operation will allow you to divide musical texts and their f

SergeyBPshenichnikov Feb 6 2024 at 13:48

ALGEBRA OF SENSE

Medium

12 min

767

Natural Language Processing * Data Mining *

FAQ

Translation

Sergey Pshenichnikov

Sign sequences (for example, verbal and musical texts) can be turned into mathematical objects. Words and numbers have become one entity, a representation of a matrix unit, which is a matrix generalization of integers and a hypercomplex number. A matrix unit is a matrix in which one element is equal to unit, and the rest are zeros.

If the words of the text are represented by such matrices, then concatenation (combination while maintaining order) of words and texts becomes an operation of adding matrices.

You can perform transformations with texts using algebraic operations, for example, dividing one text by another with a remainder. Mathematically recognize the sense of text and calculate the context of words. In this case, algebra helps to interpret all the intermediate stages of calculations.

A person sees and hears only what he understands (J.W. Goethe). Understands what he attaches sense to as significant for him. Sense is subjective and depends on the interests, motivations, and feelings of different people.

L. S. Vygotsky distinguished between the concepts of «sense» and «meaning»: «if the «meaning» of a word is an objective reflection of a system of connections and relationships, then « sense» is the introduction of subjective aspects of meaning according to a given moment and situation».

According to G. Frege, «meaning» are properties, relationships of objects, «sense» is only part of these properties. In this case, both “meanings” and «sense» are called one «sign», for example a word. Two people can choose from a list of meanings for one word two non-overlapping fragments (two senses) to interpret it.

hokid Feb 21 2023 at 07:49

How to write simple intelligent code search with Open AI

Medium

6 min

6.3K

Python * JavaScript * Node.JS * TypeScript * Natural Language Processing *

Tutorial

In this article, we will briefly review a technology that underlies ChatGPT — embeddings. Also we’ll write a simple intelligent search in a codebase of a project.

bashnick Jan 24 2023 at 23:03

Building a GPT-like Model from Scratch with Detailed Theory and Code Implementation

14 min

44K

Open Data Science corporate blogPython * Machine learning * Artificial IntelligenceNatural Language Processing *

Tutorial

Unlock the power of Transformer Neural Networks and learn how to build your own GPT-like model from scratch. In this in-depth guide, we will delve into the theory and provide a step-by-step code implementation to help you create your own miniGPT model. The final code is only 400 lines and works on both CPUs as well as on the GPUs. If you want to jump straight to the implementation here is the GitHub repo.

Transformers are revolutionizing the world of artificial intelligence. This simple, but very powerful neural network architecture, introduced in 2017, has quickly become the go-to choice for natural language processing, generative AI, and more. With the help of transformers, we've seen the creation of cutting-edge AI products like BERT, GPT-x, DALL-E, and AlphaFold, which are changing the way we interact with language and solve complex problems like protein folding. And the exciting possibilities don't stop there - transformers are also making waves in the field of computer vision with the advent of Vision Transformers.

+25

snakers4 Jun 30 2022 at 12:39

Multilingual Text-to-Speech Models for Indic Languages

5 min

3.8K

Machine learning * Natural Language Processing * Voice user interfaces *

In this article, we shall provide some background on how multilingual multi-speaker models work and test an Indic TTS model that supports 9 languages and 17 speakers (Hindi, Malayalam, Manipuri, Bengali, Rajasthani, Tamil, Telugu, Gujarati, Kannada).

It seems a bit counter-intuitive at first that one model can support so many languages and speakers provided that each Indic language has its own alphabet, but we shall see how it was implemented.

Also, we shall list the specs of these models like supported sampling rates and try something cool – making speakers of different Indic languages speak Hindi. Please, if you are a native speaker of any of these languages, share your opinion on how these voices sound, both in their respective language and in Hindi.

vldmrvslv Jun 29 2022 at 14:24

Detecting attempts of mass influencing via social networks using NLP. Part 2

3 min

2.3K

Python * Natural Language Processing * Twitter API * Data Mining * Big Data *

Tutorial

In Part 1 of this article, I built and compared two classifiers to detect trolls on Twitter. You can check it out here.

Now, time has come to look more deeply into the datasets to find some patterns using exploratory data analysis and topic modelling.

EDA

To do just that, I first created a word cloud of the most common words, which you can see below.

vldmrvslv Jun 29 2022 at 14:20

Detecting attempts of mass influencing via social networks using NLP. Part 1

5 min

2.6K

Big Data * Python * Data Mining * Natural Language Processing * Twitter API *

Tutorial

During the last decades, the world’s population has been developing as an information society, which means that information started to play a substantial end-to-end role in all life aspects and processes. In view of the growing demand for a free flow of information, social networks have become a force to be reckoned with. The ways of war-waging have also changed: instead of conventional weapons, governments now use political warfare, including fake news, a type of propaganda aimed at deliberate disinformation or hoaxes. And the lack of content control mechanisms makes it easy to spread any information as long as people believe in it.

Based on this premise, I’ve decided to experiment with different NLP approaches and build a classifier that could be used to detect either bots or fake content generated by trolls on Twitter in order to influence people.

In this first part of the article, I will cover the data collection process, preprocessing, feature extraction, classification itself and the evaluation of the models’ performance. In Part 2, I will dive deeper into the troll problem, conduct exploratory analysis to find patterns in the trolls’ behaviour and define the topics that seemed of great interest to them back in 2016.

Features for analysis

From all possible data to use (like hashtags, account language, tweet text, URLs, external links or references, tweet date and time), I settled upon English tweet text, Russian tweet text and hashtags. Tweet text is the main feature for analysis because it contains almost all essential characteristics that are typical for trolling activities in general, such as abuse, rudeness, external resources references, provocations and bullying. Hashtags were chosen as another source of textual information as they represent the central message of a tweet in one or two words.

vldmrvslv Jun 23 2022 at 15:04

How we tackled document recognition issues for autonomus and automatic payments using OCR and NER

5 min

1.6K

Python * Natural Language Processing *

From sandbox

In this article, I would like to describe how we’ve tackled the named entity recognition (aka NER) issue at Sber with the help of advanced AI techniques. It is one of many natural language processing (NLP) tasks that allows you to automatically extract data from unstructured text. This includes monetary values, dates, or names, surnames and positions.

Just imagine countless textual documents even a medium-sized organisation deals with on a daily basis, let alone huge corporations. Take Sber, for example: it is the largest financial institution in Russia, Central and Eastern Europe that has about 16,500 offices with over 250,000 employees, 137 million retail and 1.1 million corporate clients in 22 countries. As you can imagine, with such an enormous scale, the company collaborates with hundreds of suppliers, contractors and other counterparties, which implies thousands of contracts. For instance, the estimated number of legal documents to be processed in 2022 has been over 65,000, each of them consisting of 30 pages on average. During the lifecycle of a contract, a contract usually updated with 3 to 5 additional agreements. On top of this, a contract is accompanied by various source documents describing transactions. And in the PDF format, too.

Previously, the processing duty befell our service centre’s employees who checked whether payment details in a bill match those in the contract and then sent the document to the Accounting Department where an accountant double-checked everything. This is quite a long journey to a payment, right?

SergeyBPshenichnikov Jun 8 2022 at 15:38

Algebra of text without formulas

64 min

Search engines * Semantics * Algorithms * Natural Language Processing *

Translation

The article is an abstract of my book [1] based on previously presented publications [2], [3], [4], [5]

SergeyBPshenichnikov Jun 7 2022 at 19:41

Collective meaning recognition

37 min

2.4K

Search engines * Semantics * Algorithms * Natural Language Processing *

Translation

The published material is in the Appendix of my book [1]

Modern civilization finds itself at a crossroads in which to choose the meaning of life. Because of the development of technology, the majority of the world's population may be "superfluous" - not in demand in the production of values. There is another option, where each person is a supreme value, an absolute individual and can be indispensably useful in the technology of the collective mind.

In the eighties of the last century, the task of creating a scientific field of "collective intelligence" was set. Collective intelligence is defined as the ability of the collective to find solutions to problems more effectively than each participant individually. The right collective mind must be...

snakers4 Apr 12 2022 at 18:08

Our new public speech synthesis in super-high quality, 10x faster and more stable

3 min

6.2K

Natural Language Processing * Voice user interfaces * Machine learning *

hero_image

In our last article we made a bunch of promises about our speech synthesis.

After a lot of hard work we finally have delivered upon these promises:

Model size reduced 2x;
New models are 10x faster;
We added flags to control stress;
Now the models can make proper pauses;
High quality voice added (and unlimited "random" voices);
All speakers squeezed into the same model;
Input length limitations lifted, now models can work with paragraphs of text;
Pauses, speed and pitch can be controlled via SSML;
Sampling rates of 8, 24 or 48 kHz are supported;
Models are much more stable — they do not omit words anymore;

This is a truly break-through achievement for us and we are not planning to stop anytime soon. We will be adding as many languages as possible shortly (the CIS languages, English, European languages, Hindic languages). Also we are still planning to make our models additional 2-5x faster.

We are also planning to add phonemes and a new model for stress, as well as to reduce the minimum amount of audio required to train a high-quality voice to 5 — 15 minutes.

As usual you can try our model in our repo or in colab.

+13

SergeyBPshenichnikov Dec 1 2021 at 18:06

Concordance of sense

17 min

1.5K

Natural Language Processing * Algorithms * Semantics * Search engines *

Translation

In [1,2,3] texts (sign sequences with repetitions) were transformed (coordinated) into algebraic systems using matrix units as word images. Coordinatization is a necessary condition of algebraization of any subject area. Function (arrow) (7) in [1]) is a matrix coordinatization of text. One can perform algebraic operations with words and fragments of matrix texts as with integers, but taking into account the noncommutativity of multiplication of words as matrices. Structurization of texts is reduced to the calculation of ideals and categories of texts in matrix form.

averkij Nov 21 2021 at 13:35

How to create bilingual books. Part 2. Lingtrain Alignment Studio

6 min

4.9K

Natural Language Processing * Open source * Learning languagesProgramming *

Tutorial

title

How to make a parallel book for language learning. Part 1. Python and Colab version

This is a second article on making parallel books. Today we will use the more advanced tool which will bring rich UI functionality. Lingtrain Alignment Studio is a web application written on Vue and Python. The main purpose of it is to extract the parallel corpora from two raw texts and make a bilingual (or even multilingual) parallel book. This is an open-source project and I will be glad to hear all of your bright ideas. Links to the sources and our community contacts can be found below. Los geht's!

Setup

The app is packed into the docker container. It's a simple technology to deploy your stuff anywhere from the server to your local machine. It's available across all the operating systems. So at first, you need a docker installed locally. Then you need to run two simple commands. The first will download the container:

docker pull lingtrain/aligner:v4

And the second one will run the application:

docker run -v C:\app\data:/app/data -v C:\app\img:/app/static/img -p 80:80 lingtrain/aligner:v4

C:\app\data and C:\app\img — your local folders.

The app will be available on the 80th port. Let's open the localhost page in your favorite browser.

Lingtrain app 1

We will make three simple steps: Load, Align, Create

averkij Oct 31 2021 at 15:44

Lingtrain Aligner. How to make parallel books for language learning. Part 1. Python and Colab version

8 min

5.6K

Programming * Machine learning * Learning languagesOpen source * Natural Language Processing *

title

If you're interested in learning new languages or teaching them, then you probably know such a way as parallel reading. It helps to immerse yourself in the context, increases the vocabulary, and allows you to enjoy the learning process. When it comes to reading, you most likely want to choose your favorite author, theme, or something familiar and this is often impossible if no one has published such a variant of a parallel book. It's becoming even worse when you're learning some cool language like Hungarian or Japanese.

Today we are taking a big step forward toward breaking this situation.

We will use the lingtrain_aligner tool. It's an open-source project on Python which aims to help all the people eager to learn foreign languages. It's a part of the Lingtrain project, you can follow us on Telegram, Facebook and Instagram. Let's start!

Find the texts

At first, we should find two texts we want to align. Let's take two editions of "To Kill a Mockingbird" by Harper Lee, in Russian and the original one.

snakers4 Oct 6 2021 at 14:20

We have published a model for text repunctuation and recapitalization for four languages

7 min

9.5K

Big Data * Natural Language Processing * Python * Machine learning *

Working with speech recognition models we often encounter misconceptions among potential customers and users (mostly related to the fact that people have a hard time distinguishing substance over form). People also tend to believe that punctuation marks and spaces are somehow obviously present in spoken speech, when in fact real spoken speech and written speech are entirely different beasts.

Of course you can just start each sentence with a capital letter and put a full stop at the end. But it is preferable to have some relatively simple and universal solution for "restoring" punctuation marks and capital letters in sentences that our speech recognition system generates. And it would be really nice if such a system worked with any texts in general.

For this reason, we would like to share a system that:

Inserts capital letters and basic punctuation marks (dot, comma, hyphen, question mark, exclamation mark, dash for Russian);
Works for 4 languages (Russian, English, German, Spanish) and can be extended;
By design is domain agnostic and is not based on any hard-coded rules;
Has non-trivial metrics and succeeds in the task of improving text readability;

To reiterate — the purpose of such a system is only to improve the readability of the text. It does not add information to the text that did not originally exist.

SergeyBPshenichnikov Apr 23 2021 at 10:01

Context category

12 min

2.4K

Search engines * Semantics * Algorithms * Natural Language Processing *

Translation

The mathematical model of signed sequences with repetitions (texts) is a multiset. The multiset was defined by D. Knuth in 1969 and later studied in detail by A. B. Petrovsky [1]. The universal property of a multiset is the existence of identical elements. The limiting case of a multiset with unit multiplicities of elements is a set. A set with unit multiplicities corresponding to a multiset is called its generating set or domain. A set with zero multiplicity is an empty set.