Pull to refresh


Level of difficultyMedium
Reading time14 min
Original author: S.B. Pshenichnikov

S.B. Pshenichnikov

The article outlines a new mathematical apparatus for verbal calculations in NLP (natural language processing). Words are embedded not in a real vector space, but in an algebra of extremely sparse matrix units. Calculations become evidence-based and transparent. The example shows forks in calculations that go unnoticed when using traditional approaches, and the result may be unexpected.


The use of IT in Natural Language Processing (NLP) requires standardization of texts, for example, tokenization or lemmatization.

After this, you can try to use mathematics, since it is the highest form of standardization and turns the objects under study into ideal ones, for example, data tables into matrices of elements. Only in the language of matrices can one search for general patterns in data (numbers and texts).

If text is turned into numbers, then in NLP these are first natural numbers for numbering words, which are then embedded into real vectors is irreversible ed in a real vector space.

Perhaps we should not rush to do this but come up with a new type of numbers that is more suitable for NLP than numbers for studying physical phenomena. These are matrix hyperbinary numbers. Hyperbinary numbers are one of the types of hypercomplex numbers.

Hyperbinary numbers have their  own  arithmetic,  and  if  you get used to  it,  it  will  seem  more  familiar  and  simpler  than  Pythagorean arithmetic.

In Decision Support Systems (DSS), the texts are value judgments and a numbered verbal rating scale. Next (as in NLP), the numbers are turned into vectors of real numbers and used as sets of weighted arithmetic average coefficients.

The numbers are mixed, and it is impossible to recover the terms and factors of the calculation cycles from the final value of the result. Embedding into real vectors is irreversible. The results cannot be explained, the methods are not evidence-based.

Lack of evidence leads to the impossibility of researching a solution. If the solution were obtained in analytical form, then it would be possible to observe all the forks of the calculations, and the text of the solution could be found from the compiled (for example, verbal) equations. Using a popular example in DSS, we will further show how to find the place in the calculations that is responsible for the lack of a unique solution.

It is possible to represent words and topics of texts not as real vectors, but as matrix units (extremely sparse matrices), and calculations with them can and should be done in symbolic form (Computer Algebra System, CAS).

This is possible because for matrix units, due to their extreme sparsity, there are relations that allow algebraic operations with them without using the explicit form of matrices. Matrix units as representations of words are not mixed during symbolic verbal calculations (VC), all intermediate results can be decoded back into words of natural language, and the result can be explained and proven to the user and the decision maker.

1. What's on offer

It is advisable to replace the words of the text with extremely sparse square binary matrices. These are matrix units. All their elements except one are zero. The unit is at the intersection of row i and column j. Index i denotes the number of the designated word in the text, j – the number of the word in the dictionary.

A dictionary is the original text with repetitions of words removed. Then the text is the sum of matrix units. Words and fragments of such matrix text (like matrices) can be added, multiplied, and divided with a remainder, like natural numbers.

The addition operation can produce the result of addition from a set of hyperbinary numbers. The elements of matrices can be natural numbers, and for texts this means that there can be several words in one place (word number). But the addition of hyperbinary numbers is redefined in such a way [1, p. 114] that the problem disappears. This technique is common in mathematics. When dividing integers, the division operation is defined as division with a remainder. Then the result of dividing integers is always an integer.

The subtraction operation also produces a set of matrix units from a set of binary numbers, just as subtractions of natural numbers turn them into integers. However, the subtraction operation can be defined in such a way that this problem also disappears, like the addition of hyperbinary numbers.

Matrix generalizations of the binary numbers 0 and 1 are called hyperbinary numbers.

Matrix units have a unique property that is a consequence of their extreme non-zero sparsity. To commit with matrix units arithmetic operations do not need to be explicitly represented by matrices. To perform operations, it is enough to know the indices i, j. There is a defining relation (general formula) for products of matrix units. The result of the product depends only on i and j. Therefore, sparsity here is not a burden for computing, writing, and storing hyperbinary numbers.

You can perform algebraic operations with hyperbinary  numbers.  But    if  you  do  not  explicitly  use  a  matrix representation  but  carry  out symbolic  calculations using  CAS  (computer algebra system) methods, then all intermediate calculation results are transparent and verifiable.

Matrix generalizations of complex numbers are called hypercomplex numbers. Their founding father is W. Hamilton, president of the Irish Academy and corresponding member of the St. Petersburg Academy. At   that    time,  correspondent   members differed   from  academicians only  in  the  form  of participation - they worked remotely by correspondence.

Hamilton, after party in 1843, came up with quaternions - the first hypercomplex numbers. Henri Poincaré compared this discovery in arithmetic with Lobachevsky's revolution in geometry.

Here we use one of the many types of hypercomplex numbers - hyperbinary numbers, which can be depicted on a plane if natural numbers are considered under their coordinates - a pair of indices of matrix units. This method of graphical representation goes back to Isaac Newton, who represented on the plane the degrees of monomial terms in polynomials of two variables (Newton's polygons). For matrix polynomial texts, one index is the number of the word in the text, the second is in the dictionary.

Dictionaries of matrix texts are sums of matrix units (monomials) with the same indices. The units are on the diagonal. Each such matrix unit is a projector in its algebraic properties. The sum of all projectors (words of the text dictionary) is the identity matrix.

In what follows, word, text and dictionary are understood as sets of matrix units.

Multiplication of hyperbinary numbers (words and text fragments) occurs on the left and on the right, since they are full-fledged matrices. The results of multiplication are different. Multiplication is non-commutative, unlike addition.

The text has two dictionaries. Left and right. The left dictionary is the sum of matrix units with the same indices i. The right one – with the same indices j.

The left dictionary is the sum of all diagonal matrix units - their unit is on the main diagonal. Their indices i,i are the numbers of all words of the text, including repetitions. The right dictionary is the sum of all diagonal matrix units with indices j, j - these are the numbers of all words in the dictionary.

The left and right dictionaries are identity matrices of the same dimension. When multiplying from the left and right by these dictionaries, the text does not change.

But if the text is multiplied by fragments of dictionaries (this is the sum of the projectors) on the left and right, then the text is converted. The right dictionary fragment removes words from the text that are not in this dictionary fragment. Left fragment - reduces the text in volume, creating text-forming fragments.

Fragments of the left and right dictionary are responsible for calculating stable n-grams of the text and determining the key concept of VC - verbal average.

If you simultaneously multiply the text on the left and right by the corresponding fragments of the left and right dictionaries, then repetitions of n-grams will remain from the text.

The fragment of the left dictionary forms the order of words in the n-gram and their location in the text. The fragment of the right dictionary is responsible for the composition of words in the n-gram. By composing an n-gram, a query is formulated to search for it in the text. The search algorithm consists of matching an n-gram of two (left and right) dictionaries and multiplying them by a set of texts. Explicit representation of words by matrices is not required. Only the defining relation of the product of matrix units is sufficient.

When producing dictionaries, only matrix units with the same indices will be non-zero. This means that these dictionaries have a common fragment (common sub-dictionary) or, what is the same, texts that have common sub-dictionaries have the same words.

If the two texts on the right are multiplied by the product of their right vocabularies and the results are added, then this will be the text of the verbal average of the two texts. Indeed, when multiplying right dictionaries, a common subdictionary will remain (a projector in its algebraic properties). When multiplying this subdictionary on the right by each text, only common words will remain. Their sum belongs to each source text and is their average (total) text.

If texts do not have common words, then their verbal average is zero.

The concept of verbal average is applicable to any set of texts and their fragments.

As mentioned above, Isaac Newton in 1649 depicted polynomials in two variables on a plane. The degrees of the variables x, y of the polynomials was represented by a point on the x, y plane. The polynomials themselves turned out to be broken lines on this plane with coordinates that are natural numbers.

It turned out that these broken lines can be turned into convex polygons and with their help we can find approximate solutions to systems of polynomial equations, even without taking into account the coefficients of the monomials.

There is  a  developed theory  for Newton's  polygons (N.G. Chebotarev, 1943).  With  convex  polygons,  you  can  visually  perform  all algebraic operations. It would be tempting to geometrically add, multiply and divide texts, solving problems of their classification and categorization.

Newton's polygons are ideal for matrix texts. If the indices of the matrix word  are represented by the coordinates of a point (natural numbers) on the plane, then the matrix texts, like Newton’s polynomials, will be broken lines on this plane i, j.

Verbal average lives up to its name. On a plane with natural coordinates, it is located like a broken line between the texts, relative to which  it  is calculated  using  the  above  method. This follows from the fact  that  the  verbal  average  is  obtained by multiplying the text by the projector. Then the coordinates (indices) of the word are located inside the coordinate area where the words of the entire text are located.

Algebraically, the verbal average is a common divisor of texts or, in NLP (ChatGPT) terminology, a topic. On the Newton plane, matrix words, texts, and their general themes (catalogues) are clearly depicted geometrically.

Representing texts only as sums of matrix words is not enough. You need to pair the word with its context.

According to the distributional hypothesis, linguistic units found in similar contexts have similar meanings. Consequently, the image (representation) of a linguistic unit (words and their combinations) is a pair (“word”, “context”) or in the usual form – (context)text.

Text is an ordered combination of words. If a word is understood as a “(context)word” pair, then a text is ordered pairs of words and their contexts. With Newton’s geometric representation, this means that the plane of texts from words corresponds to the dual (conjugate) plane of the contexts of these words.

A generalization of the distributional hypothesis is the hypothesis about the ideal text for “(context)word” pairs:

The concatenation of contexts of words of an ideal text is this ideal text, and the contexts are such that their concatenation constitutes the ideal text itself.

The ideal text hypothesis may be a sketch of a technical requirement for probative machine text generation.

In algebra of text, a word is supplemented by its phantom factor, which is the context of the word. When words are added in a matrix text, their phantom factors are also added.

When  adding,    the resulting  phantom   multiplier  can   be  either the  verbal  average  (the  intersection  of contexts)  or  the  complement  of the context. When the verbal average (intersection) is calculated for two texts, there is also a remainder-complement (as when dividing integers).

Residues for dividing texts have the sense of deviations of a set of texts  from  their  verbal  average  and are like residues for comparisons of integers.

Остатки деления текстов имеют смысл отклонения набора текстов от их вербального среднего и похожи на вычеты сравнений целых чисел.

When generating text, the result can be either an intersection of contexts or an addition - this depends on the given summary (title, abstract; a set of keywords and their contexts, ordered by importance).

Left dictionaries form a standard (habitual) presentation of the text in a form that makes it easier to understand the content of the text.

Subdictionaries of the right dictionary consisting of function words allow you to choose ways to combine, intersect and supplement contexts (phantom multipliers). For example, these are connecting, adversative, dividing conjunctions.

The resulting context of the next word must be consistent with the given text summaries. The agreement tool is the division operation with the remainder of the summary and context of the next word.

Двойственной к словам плоскостью Ньютона в алгебре текста является плоскость фантомных множителей к этим словам.

The Newton plane dual to words in algebra of text is the plane of phantom factors to these words.

The dataset for algebraic text generation is (context)word pairs. Perhaps ChatGPT can be useful for creating pairs as a preliminary markup of the language corpus as an explanatory dictionary.

The concept of word importance as n-gram frequency requires clarification. It may need to be supplemented by consistency of word contexts in n-grams.

Contexts of words, in turn, consist of words that, according to the distributional hypothesis, depend on their contexts - phantom factors of the second level and so on to any depth of fine-tuning of the ideal text of the corresponding level. In algebra of text, refined contexts of the corresponding levels are an analogue of deep learning in NLP.

The two directions DSS (Decision Support System) and NLP (Natural Language Processing) are similar in the computational methods used and have common problems.

In both cases, the basis is a dictionary of assessments, goals, criteria, alternatives (DSS) and text dictionaries (NLP). Words are “embedded” into natural numbers (numbering), and then into real numbers. Multiple sets of weighting coefficients are calculated for layer-by-layer calculation of weighted average estimates, reverse error calculations are inconsistencies of pairwise comparison matrices in DSS or errors when learning a neural network algorithm in NLP.

There is also a common problem in DSS and NLP - the inability to explain how the result was obtained, justify it and check for optimality or globality, if the solution is not the only one, senseless or paradoxical,  as  in  practice  often  happens  for  the customer (decision maker) or the owner of the chatbot. The result cannot be proven (explained).

Successful examples of DSS in terms of efficiency are the Analytic Hierarchy Process (AHP) by T. Saaty, and NLP – ChatGPT.

The DSS input is text datasets of value judgments in the form of matrices of paired comparisons of alternatives (solution options) and evaluation criteria (properties). The output of the DSS is decision texts ordered by importance (preference) in accordance with the specified evaluation criteria.

Weighting coefficients are calculated for sets of weighted average assessments of alternatives based on sets of criteria of different importance. In the DSS computational block, there is a layer-by-layer calculation of sets of weighting coefficients for average ratings. They are determined by comparing alternatives with alternatives in the sense of an individual criterion and criteria with each other. At the final stage, a synthesis (compilation) of average assessments of alternatives and criteria is carried out in accordance with the purpose of the problem.

But in AHP, the verbal rating scale is designated by natural numbers as ordinals (ordinal numbers), but then the ordinals suddenly turn into cardinal numbers and arithmetic operations are performed on them. Emerging problems associated with inconsistency of datasets (pairwise comparison matrices) are resolved using a variety of heuristic techniques.

The ChatGPT language model (LM) (layered sets of weights for averages)  cannot be relearning without calculating the LM again from scratch.

The proposed approach uses only natural numbers. And only as ordinal numbers (ordinals). Arithmetic operations are not performed with them. Cardinal numbers here are matrix generalizations of the binary numbers 0 and 1, which represent words. Such matrix words can  be  added to texts and divided with a remainder by each other (like integers). In this case, the words remain hyperbinary (matrix binary) numbers.

An example of verbal calculation of a solution to a multicriteria problem is given. The input is a dataset from a set of texts of value judgments, the output is the text of the decision. At the same time, two solutions are outlined - quick and evidence-based. The evidential one is distinguished by additional index marking of the dataset. In this case, each cycle of calculations can be decoded into words, the result can be explained and proven.

Another innovation to the contextual dataset could be the ordering of words in contexts by importance. (Context)word pairs can be ordered by importance using the evidential DSS method outlined below.

This ordering is based on matrices of pairwise comparisons of words in the sense of their contexts and contexts in the sense of words. In these two types of comparisons, matrices of paired comparisons have names of meaning - context or word.

Filling out matrices of paired comparisons, as a semantic marking of importance, can be automatic or expert. Automatic marking of importance is possible if the expert formulates an evaluation rule.

Examples of importance marking tasks are the following:

Fragments of text related by importance are specified. It is required to calculate a composite text that takes into account the mutual order of fragments.

A text is specified, for fragments of which their importance is known. It is required to calculate from the source text a short text that has the same meaning of importance as the source text.

The adjective “evidence-based” used in the title of the article is inspired by the term “evidence-based medicine.”

The main principle of evidence-based medicine is the transparency of the rationale for clinical decisions (at least ideally).

It is understood that the main distinguishing feature of the proposed research tool is the observability and interpretability of all computational cycles. All intermediate results of algebraic calculations can be decoded into words for the necessary interpretation  of  the  impact  on the result. These is the sense of verbal calculations.

Evidentiary means the generation of a detailed report on the verbal calculations performed with the interpretation of the calculation cycles and an indication of the mutual influences on the result.

An example of verbal calculation of a solution to a multicriteria problem is given. The input is a dataset from a set of texts of value judgments, the output is the text of the decision.

A place in the calculations is revealed when the solution branches. Since all computation cycles are transparent, the two estimates in the pairwise comparison tables of the input dataset are responsible for the fork. Surprisingly, as the calculations progress, they appear to be secondary and insignificant. But their influence turned out to be deafening and an outsider solution may be on the pedestal.

If we neglect these secondary estimates, then the result of verbal calculations coincides with the result obtained in [2, p. 105] using Saaty’s method

solaris ≻ rio≻ logan ≻ vesta ≻ cruze

If the expert or decision maker insists on his initial assessments, then according to verbal calculations, vesta suddenly becomes the winner.

This is a qualitatively new result, which could not have taken place due to the fact that the original data of expert assessments were so embedding by Saaty in real numbers that the weighted arithmetic averages were unacceptably smoothed, and imperceptible turns and discoveries were missed.

An important task of NLP can be retelling an ideal text, for example, an author’s text, “in your own words” - this is the main technique for understanding the text. The author's text  is  compiled  as  universal, in the  opinion  of  the  writer,  understandable  and interesting to as many  readers  as  possible. This  is  usually  an impossible task. This phenomenon   is perfectly   conveyed   in   Jack  London's  novel Martin Eden.

The author’s text is either deliberately redundant so that every reader would have something understandable and interesting (it’s different for everyone), or severely brief (the author has neglected universal understanding), or a mixed version (as here).

An assistant- reteller) (autoReteller) is needed, whose purpose is to transform the author's text into the reader's personal text (in his personal contextual language). This is not a brief universal summary of the author’s text, but a specialized “reprint” in a single copy.

In algebra of text, this means transforming one ideal test (in the above sense) into another ideal text in the Newton plane and the conjugate contextual (phantom) one. Each text (the author’s and the reader’s) must have the property of being an ideal text but be different texts. At the same time, a necessary condition for the success of retelling is the presence of a reader’s dataset as a personal contextual dictionary.

The next publication will present an example of such a reader dataset of the said novel by D. London.

The article is based on the research presented in [1], [2] and is described in detail here [3].


1.      Sergey Pshenichnikov, ALGEBRA OF TEXT, Ridero, Ekaterinburg, 2022 , 236 pp. ISBN: 978-5-0056-9708-0


2.      Sergey Pshenichnikov, Algebra of text for judgments. Self-tutorial understanding, Ridero, Ekaterinburg, 2024 , 159 pp. ISBN: 978-5-0062-3246-4


3.       S.B. Pshenichnikov. Verbal calculation (VC) in evidence-based DSS and NLP. 48 pp. https://www.researchgate.net/publication/380268011_VERBAL_CALCULATION_VC_IN_EVIDENCE-BASED_DSS_AND_NLP