Pull to refresh

Mathematics of Machine Learning based on Lattice Theory

Reading time7 min
Views2K

This is a third article in the series of works (see also first one and second one) describing Machine Learning system based on Lattice Theory named 'VKF-system'. It uses structural (lattice theoretic) approach to representing training objects and their fragments considered to be causes of the target property. The system computes these fragments as similarities between some subsets of training objects. There exists the algebraic theory for such representations, called Formal Concept Analysis (FCA). However the system uses randomized algorithms to remove drawbacks of the unrestricted approach. The details follow…
Areas of Formal Concept Analysis


Introduction


We begin with demonstration of our approach by its application to a school problem.
It is to find sufficient conditions on a convex quadrangle possessing symmetries to be circled and to predict this property of the rectangle.


Hence there are two target classes: positive (there exists a circle around the quadrangle) and negative.


The training sample contains square, isosceles trapezoid, diamond, and deltoid (see the rows labels in Table below).


A single test example is the rectangle.


We represent each quadrangle by a subset of attributes related to its possible symmetries:


"There exists a central symmetry point" (A),
"The group of rotations is trivial" (B),
"The group of rotations contains at least two elements" (C),
"There is a diagonal symmetry axis" (D),
"There is a non-diagonal symmetry axis" (E).
They correspond to columns labels in Table below.


quadrangle target A B C D E
square 1 1 0 1 1 1
trapezoid 1 0 1 0 0 1
diamond 0 1 0 1 1 0
deltoid 0 0 1 0 1 0
rectangle ? 1 0 1 0 1

To discover possible causes (in terms of symmetries) the system computes similarities (common attributes) between training examples of same sign. Hence we have


$\langle\{square,trapezoid\}, \{E\}\rangle,$


where first subset collects parents (all the training objects whose similarity computed), and the second is a common fragment of these examples.


Since common fragment $\{E\}$ is a part of rectangle's description $\{A,C,E\}$, the system predicts the target property positively, i.e. the rectangle is circled. It corresponds to Analogy cognitive procedure of the JSM-method. The analogues of the rectangle are parents (square and trapezoid) that have the same fragment as common part.


However, we can exchange signs: the similarity between negative examples is


$\langle\{diamond,deltoid\}, \{D\}\rangle,$


This observation leads to Argumentation Logics, but we prefer to omit the details here. Interested reader refers to the author's papers from Финн В.К., Аншаков О.М., Виноградов Д.В. (Ред.). Многозначные логики и их применения. Том 2: Логики в системах искусственного интеллекта, M.: URSS, 2020, 238 с. ISBN 978-5-382-01977-2 (in Russian).


English translations may be requested from the author (Allerton Press now is a part of Springer, but original translations are not available through any sites).


However the similarity between negative examples demonstrate the 'counter-example forbidden' condition, since its fragment $\{D\}$ is a part of description $\{A,C,D,E\}$ of opposite sign example 'square'.


1. Formal Concept Analysis


Initially the author planned to represent the theory in terms of so-called JSM-method of automatic hypotheses generation. But its creator said the doubt on possibility to express 'rich ideas of JSM-method in popular fashion'. Hence the author decide to use FCA language for this article. However the author will use some own terms paired with original ones (in brackets) where he prefer to change the terminology.


A sample (=formal context) is a triple $(G,M,I)$ where $G$ and $M$ are finite sets and $I \subseteq G \times M$. The elements of $G$ and $M$ are called objects and attributes, respectively. As usual, we write $gIm$ instead of $\langle g,m\rangle \in I$ to denote that object $g$ has attribute $m$.


For $A\subseteq G$ and $B\subseteq M$, define


$A' = \{ m \in M | \forall g \in A (gIm) \}, \\ B' = \{ g \in G | \forall m \in B (gIm) \};$


so $A'$ is the set of attributes common to all the objects in $A$ and $B'$ is the set of objects possessing all the attributes in $B$. The maps $(\cdot)': 2^G \rightarrow 2^M$ and $(\cdot)':2^M\rightarrow 2^G$ are called polars (=derivation operators) of sample $(G,M,I)$.


A candidate (=formal concept) of sample $(G,M,I)$ is defined to be a pair $\langle A,B \rangle$, where $A\subseteq G$, $B\subseteq M$, $A'=B$, and $B'=A$. The first component $A$ of candidate $\langle A,B \rangle$ is called the parents list (=extent) of the candidate, and the second component $B$ is called its fragment (=intent). The set of all candidates of sample $(G,M,I)$ is denoted by $L(G,M,I)$.


It is an easy exercise to check that $L(G,M,I)$ is a lattice with operations


$\langle A_{1},B_{1}\rangle\vee\langle A_{2},B_{2}\rangle= \langle(A_{1}\cup A_{2})'',B_{1}\cap B_{2}\rangle, \\ \langle A_{1},B_{1}\rangle\wedge\langle A_{2},B_{2}\rangle= \langle A_{1}\cap A_{2},(B_{1}\cup B_{2})''\rangle.$


We use a special case: for $\langle A,B \rangle\in L(G,M,I)$, $g\in G$, and $m\in M$ define


$CbO(\langle A,B\rangle,g) = \langle(A\cup\{g\})'',B\cap\{g\}'\rangle,\\ CbO(\langle A,B\rangle,m) = \langle A\cap\{m\}',(B\cup\{m\})''\rangle.$


We call these operations CbO because the first one is used in well-known Close-by-One (CbO) Algorithm to generate all the elements of $L(G,M,I)$.


Most important (monotonicity) property of CbO operations is represented in the following Lemma


Let $(G,M,I)$ be a sample, $\langle A_{1},B_{1}\rangle, \langle A_{2},B_{2}\rangle\in L(G,M,I)$, $g\in G$, and $m\in M$. Then


$\langle A_{1},B_{1}\rangle\leq \langle A_{2},B_{2}\rangle\Rightarrow CbO(\langle A_{1},B_{1}\rangle,g)\leq CbO(\langle A_{2},B_{2}\rangle,g), \\ \langle A_{1},B_{1}\rangle\leq\langle A_{2},B_{2}\rangle\Rightarrow CbO(\langle A_{1},B_{1}\rangle,g)\leq CbO(\langle A_{2},B_{2}\rangle,g).$


2. Problems with FCA


Unfortunately, the author and his colleagues have discovered and investigated some theoretical shortcomings of FCA based approach to Machine Learning:


  1. The number of hypotheses can be an exponentially large with respect to the size of input data (training sample) in the worst case.


  2. Problem of detection of large hypothesis is computational (NP-)hard.


  3. Overtraining is unavoidable and appears in practice.


  4. There are 'phantom' similarities between training examples, where each such parent has alternative hypothesis on the target property cause.



To demonstrate drawback 1 we need Boolean algebra case corresponding to the sample of coatoms as positive examples:


$M\\G$ $m_{1}$ $m_{2}$ $\ldots$ $m_{n}$
$g_{1}$ 0 1 $\ldots$ 1
$g_{2}$ 1 0 $\ldots$ 1
$\vdots$ $\vdots$ $\vdots$ $\ddots$ $\vdots$
$g_{n}$ 1 1 $\ldots$ 0

Then it is easy to check that any pair $\langle G\setminus \{g_{i_{1}},\ldots,g_{i_{k}}\},\{m_{i_{1}},\ldots,m_{i_{k}}\}\rangle$ is a candidate. Hence there are $2^n$ candidates.


To evaluate the exponential growth of the output with respect to the input, estimate memory needed to store the sample for $n=32$ as 128 bytes and memory for $2^n$ candidates as $2^{37}$ bites, i.e. 16 Gigabytes!


Drawback 2 was discovered by Prof. Sergei O. Kuznetsov (HSE Moscow).


Shortcomings 3 and 4 were discovered by the author during his Dr.Science investigations. He introduced several probabilistic models to generate 'phantom' similarities together with corresponding counter-examples to deny them. Most clear result is the asymptotic theorem that asserts that a probability of generation of 'phantom' similarity between two parents without counter-examples tends to


$1-e^{-a}-a\cdot{e^{-a}}\cdot\left[1-e^{-c\cdot\sqrt{a}}\right], $


when the probability of appearance of each attribute (considered as i.i.d. Bernoulli variable) is $p=\sqrt{a/n}\to 0$, the number of counter-examples is $m=c\cdot\sqrt{n}\to\infty$ and attributes number $n\to\infty$ too.


Note, that even smaller number $1-e^{-a}-a\cdot{e^{-a}}$ is positive since it coincides with the probability that the Poisson variable with mean $a$ has value >1.


Consult with the author's Dr.Science Thesis for more details and results.


3. Randomized Algorithms


The key idea of VKF-method is to random generate a small subset of the lattice of candidates and to use its elements as hypothetical causes for the target property. By this trick we avoid the exponentially high computational complexity of usual algorithms of FCA (and JSM-method too).


So we need algorithms like random walks on the huge lattice with generation of a candidate only when we need it.


The author invented and investigated mathematical properties of several algorithms (such as non-monotonic, monotonic, coupled, lazy coupled, and stopped coupled Markov chains). Details may be found in the author's Dr.Science Thesis.


Now we represent the coupled Markov chain algorithm that is a core of probabilistic approach to machine learning based on FCA (VKF-method).


input: sample (G,M,I), external functions CbO( , )
result: random candidate <A,B>
X=G U M; 
A = M'; B = M;  
C = G; D = G';
while (A!=C || B!= D) {
        select random element x from X;
        <A,B> = CbO(<A,B>,x);
        <C,D> = CbO(<C,D>,x);
}

There exists a lazy variant of the coupled Markov chain. The author proved that lazy computations lead to acceleration (with respect to classical scheme above) up-to


$\frac{(n+k)^2}{2k\cdot n}=2+\frac{(n-k)^2}{2k\cdot n} \geq 2 $


times, where $n$ is the attributes total and $k$ is a number of training examples.


This result matches well to experimental estimates obtained by former RSUH student Lyudmila A. Yakimova.


4. General Structure of VKF-method


In supervised Machine Learning there are two sets of objects called the training and test samples, respectively.


From positive examples of the training sample the program generates a sample $(G,M,I)$. The negative examples form the set $O$ of counter-examples (obstacles to become a VKF-hypothesis).


Set $T$ of tests contains all test objects to predict the target class.


The program invokes the algorithm of the lazy coupled Markov chain to generate random candidate $\langle A,B\rangle\in L(G,M,I)$. The program saves VKF-hypothesis $\langle A,B\rangle$, if there is no obstacle $o\in O$ such that $B\subseteq \{o\}'$.


The main Inductive Generalization Algorithm is the following


input: number N of VKF-hypotheses to generate
result: random sample S of requested VKF-hypotheses
while (i<N) {
        generate random candidate <A,B> for (G,M,I);
        hasObstacle = false;
        for (o in O) {
            if (B is a part of {o}') hasObstacle = true;
        }
        if (hasObstacle == false) {
                S = S U {<A,B>};
                i = i+1;
        }
}

Condition $B\subseteq\{o\}'$ means the inclusion of fragment $B$ of candidate $\langle{A,B}\rangle$ into the fragment (attributes subset) of counter-example $o$.


If a candidate avoids all such obstacles it is added to the result set of generated VKF-hypotheses.


We replace a time-consuming deterministic algorithm (for instance, the well-known "Close-by-One" algorithm) for generation of the all candidates by the probabilistic one to randomly generate the prescribed number of VKF-hypotheses.


After that Machine Learning system predicts the target class of tests and compares the results of prediction with the original target values. This is Prediction Algorithm


input: list T of test examples to predict the target property 
input: random sample S of candidates without counter-examples
for (x in T) {
        target(x) = false;
        for (<A,B> in S) {
            if (B is a part of {x}') target(x) = true;
        }
}

The worst situation occurs when some important positive test is missed by all generated VKF-hypotheses and obtains negative sign.


Test object $x$ is an $\varepsilon$-important, if the probability of all VKF-hypotheses $\langle A,B\rangle$ with $B\subseteq\{x\}'$ exceeds $\varepsilon$.


The author proved theorem to estimate parameter $N$ from Inductive Generalization Algorithm to avoid the worst case.


For $n=\left|{M}\right|$, for any $\varepsilon>0$, and any $1>\delta>0$ random sample $S$ of VKF-hypotheses of cardinality


$N\geq{\frac{2\cdot(n+1)-2\cdot\log_{2}{\delta} }{\varepsilon}} $


with probability $>{1-\delta}$ has property that every $\varepsilon$-important object $x$ contains a fragment of some VKF-hypothesis $\langle A,B\rangle\in{S}$, i.e. $B\subseteq\{x\}'$.


This theorem is an analogue of famous results of Prof. Vladimir N. Vapnik and Prof. Alexey Y. Chervonenkis from Computational Learning Theory.


Conclusion


The article describes main mathematical aspects of Machine Learning system based on Lattice Theory. The author call it 'VKF-system' in honour his teacher Prof. Victor K. Finn.


The last article of the series will be devoted to representations of objects with attributes of different types for application of described here Learning Machine.


Discrete attributes again require some technique from FCA. Continuous attributes ask for logistic regression, entropy-based separation of their ranges into subintervals, and presentation corresponding to convex envelope for subintervals those similarity is computed.


The author would like to thanks his colleagues and students for support and stimulus.

Tags:
Hubs:
Rating0
Comments0

Articles