Как стать автором
Обновить

Algebra of text. Examples

Поисковые технологии *Семантика *Алгоритмы *Natural Language Processing *
Перевод
Автор оригинала: Сергей Пшеничников

The previous work from ref [1] describes the method of transforming a sign sequence into algebra through an example of a linguistic text. Two other examples of algebraic structuring of texts of a different nature are given to illustrate the method.

1. Morse-Weil-Gerke code as an algebra of matrix units

The symbol sequences (texts) of 26 Latin letters in the Morse code consist only of dots and dashes. This particular example was chosen because of its extremely concise dictionary (“dot” and “dash”).

Dots or dashes here represent the words, and the texts made up of such words represent 26 letters of the alphabet. Each word has two coordinates. The first coordinate is the number of the word (dot or dash) in this letter (from one to four). The second coordinate is the number in the dictionary (1 or 2). Dictionary E1,1 ("dot") and E2,2 ("dash").

D_R=E_{1,1}+E_{2,2}
Table 1: Morse code: Latin letters as sign sequences (texts)
Table 1: Morse code: Latin letters as sign sequences (texts)

Each letter (sign sequence) with a number from the table 1 can be associated with a matrix polynomial P from 4×4–sized matrix units according to the equation (8) from the previous work [1].

Table 2: Morse code: letters as matrix polynomials
Table 2: Morse code: letters as matrix polynomials

For instance, the letter Q (№17) is associated with the matrix polynomial:

E_{12}+E_{22}+E_{31}+E_{42}= \begin{Vmatrix} 0 & 1 & 0 & 0\\ 0 & 1 & 0 & 0\\ 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 \end{Vmatrix}.

All 26 polynomial letters in the table 2 have a common feature: only three particular matrix units (E12, E21, E32) are their factors in the rightmost position.

If we represent all 26 polynomials from the table 2 as a column ||P||, and also based on the fact that the following holds true for matrices and columns:

 \begin{Vmatrix}         a_{11} & \ldots & a_{1n}\\         \ldots & \ldots & \ldots\\         a_{m1} & \ldots & a_{mn}     \end{Vmatrix}      \begin{Vmatrix}         b_{1} \\         \ldots \\         b_{n}      \end{Vmatrix}=     \begin{Vmatrix}         a_{11} \\         \ldots \\         a_{m1}      \end{Vmatrix}b_1+\ldots +     \begin{Vmatrix}         a_{1n} \\         \ldots \\         a_{mn}      \end{Vmatrix}b_n,

then the Morse code can be structured into three left ideals of matrix polynomial sets from the table 2 with bases ||P||1, ||P||2, ||P||3 :

\left\|P\right\|=\left\|P\right\|_1\left\|P\right\|_1=\left\|P\right\|_2\left\|P\right\|_2=\left\|P\right\|_3\left\|P\right\|_3,

where:

 \left\|P\right\|_1=\begin{Vmatrix}         E_{12} \\         E_{21} \\         E_{32}     \end{Vmatrix},     \left\|P\right\|_2=\begin{Vmatrix}         E_{12} \\         E_{21}E_{12} \\         E_{12}+E_{21}E_{12} \\         E_{12}E_{21} \\         E_{21} \\         E_{21}+E_{12}E_{21} \\         E_{32} E_{21} + E_{43}E_{32} E_{21} \\         E_{43}E_{32} E_{21} \\         E_{32} E_{21} \\         E_{32} \\         E_{32} + E_{43}E_{32} \\         E_{43}E_{32}     \end{Vmatrix}, \left\|P\right\|_3=\begin{Vmatrix}         E_{12}E_{21} \\         E_{12} \\         E_{21} \\         E_{21}E_{12} \\         E_{32}E_{21} \\         E_{32} \\         E_{43}E_{32} E_{21} \\         E_{43}E_{32}     \end{Vmatrix}, \ \ \ \ \ \ \ \ \ (1.1)

Symmetric matrix ||P||2(||P||2)T  as the number in diagonal elements is the number of basic elements (simple and composite matrix units) belonging to a letter; as the number in other elements it is the number of coinciding basic elements in the corresponding pair of sign sequences (letters). After normalization it determines the importance of the letter in the alphabet.

Symmetric matrix (||P||2)||P||2 as the number in diagonal elements is the number of letters belonging to the basic elements; as the number in nondiagonal elements it is the number of matching letters in the corresponding pair of basic elements. After normalization it determines the importance of the basis element (header) in the alphabet.

The Morse code is algebraically structured into three ideals (classes) with bases (1.1). The representation of the alphabet in terms of ideals describes all similar codes with bases (1.1). The representation is provided in the tables 3 and 4

Table 3: Forward indexing
Table 3: Forward indexing
Table 4: Reverse indexing
Table 4: Reverse indexing

Due to the properties of matrix polynomials (only three matrix units E12, E21, E32 can be the rightmost factors), the Morse code alphabet:

ABCDEFGHIJKLMNOPQRSTUVWXYZ

is divided into three classes (three ideals) by the three generators E12, E21, E32:

E12 – is the heading of the letters whose four character sequences start with a dash:

_BCD__G___K_MNO_Q__T___XYZ (13 letters)

E21 – is the heading of the letters in whose four character sequences a dot comes second:

_BCD_F_HI_K__N____S_UV_XY_ (13 letters)

E32 – is the heading of the letters in whose four character sequences a dash comes third:

__C__F___J K ___OP____U_W_Y_ (9 letters)

2. Algebra of mathematical text

In the example [1], the linguistic text is transformed into a mathematical object (matrix polynomial) that we can perform algebraic operations with to analyze and synthesize texts. The following example illustrates a reverse transformation: mathematical objects (formulas) are first considered as texts (sign sequences), which are then converted back into mathematical objects but different from the original ones. This new form allows a more consistent discovering of properties of mathematical objects for comparison and classification.

Formulas for the volume of a cone (Vcone) cylinder Vcylinder and torus (VT):

 V_{cone}=\frac{1}{3}\pi R_1^2H_1, V_{cylinder}=\pi R_2^2H_2, V_T=\pi^2\left(R_3+R_4\right)r  \ \ \ \ \ \ \ \ \ \ (2.1)

are first treated as texts. This means that the signs comprising the texts are not mathematical objects, and no algebraic operations can be performed on them. For example, R21 is R1R1; πR1 is not a product of two numbers but just a sequence of two characters. Signs in (2.1), R1 and H1, are the radius of the cone base and the height of the cone; R2 and H2 are the radius of the cylinder base and the height of the cylinder; R3 and R4 are the inner and outer radii of the torus, respectively; r is the radius of the generating circle of the torus, and π is π.

Semiotic analysis of formulas as texts requires the presence of repetitions of signs: repetitions determine the patterns. There are actually more repetitions of signs in the formulas (2.1) than the indicated repetitions of the π sign. The signs R1, R2, R3, R4, H1, H2 and r are segment lengths. One of the signs (for instance, r) is a simple (standard of length), while the rest of the signs are composite: R1=ar, R2=br, R3=cr, R4=dr, H1=er, H2=fr. Then the parts of the formulas (2.1) on the right side are:

\begin{gathered}      \frac{1}{3}\pi ararer \\      \pi brbrfr \\      \pi \pi \left(c+d \right)rr \end{gathered} \ \ \ \ \ \ \ \ \ \ (2.2)

Index form:

\begin{gathered}         \left(\frac{1}{3}\right)_{1,1}(\pi)_{2,2}(a)_{3,3} (r)_{4,4} (a)_{5,3} (r)_{6,4} (e)_{7,7} (r)_{8,4} \\         (\pi)_{9,2} (b)_{10,10} (r)_{11,4} (b)_{12,10} (r)_{13,4} (f)_{14,14} (r)_{15,4} \\          (\pi)_{16,2} (\pi)_{17,2} \left(c+d \right)_{18,18} (r)_{19,4}(r)_{20,4}      \end{gathered} \ \ \ \ \ \ \ \ \ \ (2.3)

Formulas (2.2) as a three-fragment polynomial of matrix units, P:

 P=F_1(P)+F_2(P)+F_3(P),     \ \ \ \ \ \ \ \ (2.4)

where:

\begin{gathered}          F_1(P) = D_L\left(E_{1,1}+E_{2,2}+E_{3,3}+E_{4,4}+E_{5,3}+E_{6,4}+E_{7,7}+E_{8,4}\right)D_R \\ F_2(P) = D_L\left(E_{9,2}+E_{10,10}+E_{11,4}+E_{12,10}+E_{13,4}+E_{14,14}+E_{15,4}\right) D_R \\ F_3(P) = D_L\left(E_{16,2}+E_{17,2}+E_{18,18}+E_{19,4}+E_{20,4}\right) D_R \\ D_R = E_{1,1}+E_{2,2}+E_{3,3}+E_{4,4}+E_{7,7}+E_{10,10}+E_{14,14}+E_{18,18} \\ D_L = E_{1,1}+E_{2,2}+E_{3,3}+E_{4,4}+E_{5,5}+E_{6,6}+E_{7,7}+ \ldots + E_{20,20} = E \\ D_L=D_R+E_{5,5}+E_{6,6}+E_{5,5}+E_{8,8}+E_{5,5}+E_{9,9}       \end{gathered}

Block-matrix form:

  P=D_LPD_R \ \ \ \ \ \ \ \ \ \ \ \ \ (2.5)

where:

The P columns contain signs from the three formulas (2.1). Two zeroes in a column indicate that the corresponding sign is present in only one formula. For example, the sign “1/3” (or E1,1), two “a” signs (or E3,3+E5,3), one “e” (or E7,7) are present only in the first formula for the cone (the first line (2.5)). Only the cylinder (second row (2.5)) has two “b” signs (or E11,11+E13,11) and one “f” (or E15,15). Only the torus (third line (2.5)) contains a (c+d) sign (or E20,20). Common signs of the cone, cylinder and torus are found in the second and fourth columns (2.5). Then:

     \begin{gathered}      P = P_{quotient_1}P_{divisor_1}+P_{remainder} \\      P = P_{quotient_2}P_{divisor_1}+P_{remainder}       \end{gathered} \ \ \ \ \ \ \ \ \ \ (2.7)

where:

 \begin{gathered} P_{quotient_1} = \left(E_{2,18}+E_{4,12}+E_{6,14}+E_{8,16}\right) +\left(E_{10,18}+E_{12,12}+E_{14,4}+E_{16,16}\right)+\\ +\left(E_{18,18}+E_{19,19}+E_{21,12}+E_{22,14}\right), \\ P_{quotient_2} = (E_{2,2}+E_{4,4}+E_{6,4}+E_{8,4})+(E_{10,2}+E_{12,4}+E_{14,4}+E_{16,4})+ \\ +(E_{18,2}+E_{19,2}+E_{21,4}+E_{22,4}), \\ P_{divisor_1} = E_{18,2} + E_{19,2}+E_{12,4} + E_{14,4} + E_{16,4}, \\ P_{divisor_2} = E_{2,2} + E_{4,4}, \\ P_{remainder} = E_{1,1}+E_{3,3} + E_{5,3}+E_{7,7}+E_{11,11} + E_{13,11}+E_{15,15}+E_{20,20}.\\     \end{gathered}

In (2.7), the matrix text is decomposed by different bases Pdivisor1 and Pdivisor2. The Pdivisor1 basis relies on the mutual positions between the repeating signs, relative to the torus in the formulas (2.1). The Pdivisor2 relies on the positions between repeating signs relative to the signs of the DR dictionary in the formulas (2.1). In a general case, relying on the position of signs in the formulas is essential if the signs are non-commutative (for example, signs are matrices, vectors, tensors or hypercomplex numbers). Still, it is useful even in scalar cases: for instance, it is the πr2 area of the circle formula that is considered canonical not r2π.

Grebner-Shirshov basis for (2.7):

     \begin{gathered} P_{divisor_1}+P_{remainder} \\ P_{divisor_2}+P_{remainder}     \end{gathered}

Then:

     \begin{gathered} P= P_{quotient_1} \left( P_{divisor_1}+P_{remainder} \right) \\ P= P_{quotient_2} \left( P_{divisor_2}+P_{remainder} \right)      \end{gathered}

Pquotient1 and Pquotient2 have repetitions (link of matrix units by the second index), and they are subject to further reduction. All the links are solvable. The additive Pquotient1 and Pquotient2 will acquire a multiplicative form (as in the language example).

The method of algebraic structuring of texts allows finding appropriate classifiers and dictionaries for texts of different nature. That is, classifying texts without a priori ascription of classification signs and class names. This kind of classification is called categorization or posterior classification. For instance, classification features for (2.4) will be:

  • Pdivisor1 and Pdivisor2 (common π and r in different places in the formulas)

  • the total number of terms in the parentheses of Pquotient1 and Pquotient2 (four)

  • the ratio of π and r in the parentheses of Pquotient1 and Pquotient2 (1, 1, 2 and 3, 3, 2)

  • factors of the multiplicative form of Pquotient1 and Pquotient1

  • various fragments of the Premainder (deductions as a class of formulas with a remainder-fragment).

Names of classes coincide with the names of the classification features and their combinations.

References

[1] Pshenichnikov S. B. Algebra of text. Researchgate Preprint, 2021.

Теги: abstract algebracategorizationontology
Хабы: Поисковые технологии Семантика Алгоритмы Natural Language Processing
Всего голосов 1: ↑1 и ↓0 +1
Комментарии 0
Комментарии Комментировать

Похожие публикации

Лучшие публикации за сутки