Nowadays most people identify Machine Learning with training of various kinds of neural networks. At the beginning there were fully connected networks, then convolutional and recurrent networks replace them, now there exist a quite exotic variants of networks such that GAN and LTSM networks.
Their training requires constantly increasing volume of samples, and they also do not be able to explain why a particular decision was made. Structural approaches to Machine Learning avoiding these drawbacks exist, the software implementation of one of which is described in the article. This is an English translation of original post by the author.
We describe one of national approaches to Machine Learning, called «VKF-method of Machine Learning based on Lattice Theory». The origin and choice of the name are explained at the end of this article.
The initial system was created by the author as a console C++ application, then it obtained MariaDB DBMS databases support (through the mariadb++ library), then it was converted into a CPython library (using the pybind11 package).
Several datasets from the UCI machine learning repository were selected as approbation of the concept. The mushrooms dataset contains descriptions of 8124 mushrooms in North America, the system achieves 100% results. More precisely, the initial data was randomly divided into a training sample (2,088 edible and 1,944 poisonous mushrooms) and a test sample (2,120 edible and 1,972 poisonous mushrooms). After computing about 100 hypotheses about the causes of edibility, all test cases were predicted correctly. Since the algorithm uses a coupled Markov chain, a sufficient number of hypotheses may vary. It was often enough to generate 50 random hypotheses. Note that when generating the causes of poisonous fungi the number of required hypotheses is grouped around 120, however, all test cases are predicted correctly in this case too.
Kaggle.com has a competition Mushroom Classification where quite a few authors have achieved 100% accuracy. However most of the solutions are neural networks. Our approach allows the mushroom picker to remember about 50 rules. Moreover, the most features are insignificant, hence each hypothesis is a conjunction of a small number of values of essential features, which makes them easy to remember. After that, the human can go for mushrooms without being afraid to take a toadstool or skip an edible mushroom.
Here is a positive hypothesis that leads to assumption of edibility of a mushroom:
[('gill_attachment', 'free'), ('gill_spacing', 'close'), ('gill_size', 'broad'), ('stalk_shape', 'enlarging'), ('stalk_surface_below_ring', 'scaly'), ('veil_type', 'partial'), ('veil_color', 'white'), ('ring_number', 'one'), ('ring_type', 'pendant')]
Please note that only 9 of the 22 features are listed, since the similarity between the edible mushrooms that generate this cause is empty on the remaining 13 attributes.
Second dataset was SPECT Hearts. There, the accuracy of predicting test examples reached 86.1%, which turned out to be slightly higher than the results (84%) of the CLIP3 Machine Learning system (Cover Learning with Integer Programming, version 3), used by the authors of the data. I believe that due to the structure of the description of heart tomograms, which are already pre-encoded by binary attributes, it is unpossible to significantly improve the quality of the forecast.
Recently the author discovered (and implemented) an extension of his approach to the processing of data described by a continuous (numeric) features. In some ways, his approach is similar to the C4.5 system of Decision Trees Learning. This variant was tested on the Wine Quality dataset. The data describes the quality of Portuguese wines. The results are encouraging: if you take high-quality red wines, the hypotheses fully explain their high ratings.
Now students at Intelligent systems Department of RSUH is developing a serie of web servers for different research areas (using Nginx + Gunicorn + Django).
Here I'll describe a different variant (based on aiohttp, aiojobs, and aiomysql). The aiomcache module is rejected due to well-known security problems.
Proposed variant has several advantages:
It has a obvious disadvatages (with respect to Django):
Each of the two options targets on different strategies for working with the web server. Synchronous strategy (in Django) is aimed at single-user mode, in which the expert works with a single database at each time. Although the probabilistic procedures of the VKF method are well parallelized, nevertheless, it is theoretically possible that Machine Learning procedures will take a significant amount of time. Therefore, the second option is aimed at several experts, each of whom can simultaneously work (in different browser tabs) with different databases that differ not only in data, but also in the way they are represented (different lattices on the values of discrete features, different significant regressions, and the number of thresholds for continuous ones). In this case, after starting the VKF computation in one tab, the expert can switch to another, where she will prepare or analyze the experiment with other data and/or parameters.
There is an auxiliary (service) database 'vkf' with two tables 'users' and 'experiments' to account for multiple users, experiments, and different stages they are at. The table 'user' stores the login and password of all registered users. The table 'experiments' saves a status of these tables in addition to the names of auxiliary and main tables for each experiment. We rejected aiohttp_session module, because we still need to use the Nginx proxy server to protect critical data.
The structure of the table 'experiments' are the following:
It should be noted that there are some sequences of data preparation for ML experiments, which, unfortunately, differ radically for the discrete and continuous cases.
The case of mixed attributes combines both types of requirements.
discrete: => goodLattices (semi-automatic)
discrete: goodLattices => goodEncoder (automatic)
discrete: goodEncoder => goodTrains (semi-automatic)
discrete: goodEncoder, goodTrains => goodHypotheses (automatic)
discrete: goodEncoder => goodTests (semi-automatic)
discrete: goodTests, goodEncoder, goodHypotheses = > (automatic)
continuous: => goodVerges (manual)
continuous: goodVerges => goodTrains (manual)
continuous: goodTrains => goodComplex (automatic)
continuous: goodComplex, goodTrains => goodHypotheses (automatic)
continuous: goodVerges => goodTests (manual)
continuous: goodTests, goodComplex, goodHypotheses = > (automatic)
Machine Learning library has name vkf.cpython-36m-x86_64-linux-gnu.so under Linux or vkf. cp36-win32.pyd under OS Windows. (36 is a version of Python that this library was built for).
The term «automatic» means using of this library, «semi-automatic» means usage of an auxiliary library 'vkfencoder.cpython-36m-x86_64-linux-gnu.so'. Finally, the «manual» mode corresponds to extern program to process data with continuous features and are now being transferred into the vkfencoder library.
We follow the «View/Model/Control» paradigm during web server creation.
Python code is distributed between 5 files:
File 'app.py' has a standard form:
I don't think anything needs to be explained here. The next file is 'views.py':
I have shortened this file by dropping classes that serve auxiliary routes:
The remaining classes correspond routes that are responsible for Machine Learning procedures:
To pass a large number of parameters to a form of vkf.html the system uses aiohttp_jinja2 construction
Note the usage of spawn from aiojobs.aiohttp:
This is necessary to safely call coroutines defined in the file 'models.py', processing user and experiment data stored in a database managed by the MariaDB DBMS:
Again some auxiliary classes are ommited:
The remaining classes correspond to the main procedures:
It is important to use create_pool() procedure from aiomysql. It creates multiple connections to a database simulteneously. To safe termination of database communication, the system uses insure_future() and gather() procedures from the asyncio module.
Construction row = cur.fetchone() returns future, hence row.result() corresponds to record of the table from which field values can be extracted (for example, str(row.result()[2]) extracts the table name with encoding of discrete feature values).
Key system parameters are imported from file '.env' or (if it absent) from file 'settings.py' directly.
It is important to note that localhost must be specified by ip address, otherwise aiomysql will try to connect to the database via a Unix socket, which may not work under OS Windows.
Finally, file 'control.py' has a form:
I retain this file, since here you can see the names of and call order of arguments of the VKF method procedures from the library 'vkf.cpython-36m-x86_64-linux-gnu.so'. All arguments after dbname can be omitted, since the default values in the CPython library are set to standard values.
Anticipating the question of professional programmers about why the logic of controlling the VKF experiment is brought out (through numerous if), and not hidden through polymorphism in types, we should answer this: unfortunately, dynamic typing of the Python language does not allow you to remove the decision about the type of object used, that is, in any case, this sequence of nested if will occur. Therefore, the author preferred to use explicit (C-like) syntax to make the logic as transparent (and efficient) as possible.
Let me comment on the missing components:
The author has been engaged in data mining for more than 30 years. After graduating Mathematics Department of Lomonosov Moscow State University, he was invited to a group of researchers under the leadership of Professor Victor K. Finn (VINITI SSSR Academy of Science). Viktor K. Finn has been researching plausible reasoning and its formalization by means of multi-valued logics since the early 80's of the last century.
The key ideas proposed by V. K. Finn are the following:
It should be noted that V. K. Finn attributes some of his ideas to foreign authors. Perhaps only the logic of argumentation is rightfully considered to have been invented by himself. The idea of accounting for counter-examples V.K. Finn borrowed, according to him, from K.R. Popper. The origins of verification of the completeness of inductive generalization were (completely obscure, in my opinion) works of the American mathematician and logician C.S. Peirce. He considers the generation of hypotheses about causes using the similarity operation to be borrowed from the ideas of the British economist, philosopher and logician J.S. Mill. Therefore, he created a set of ideas called «JSM-method» in honor of J.S. Mill.
Strange, but much more useful ideas of Professor Rudolf Wille (Germany) which are appeared in the late 70-ies of the XX century and form a modern part of algebraic Lattice Theory (so-called Formal Concept Analysis, FCA) are not respected by Professor V.K. Finn. In my opinion, the reason for this is its unfortunate name, hence it is rejected by a person who graduated first from the faculty of philosophy, and then from the Engineer requalification program at Mathematics Department of Lomonosov Moscow State University.
As a continuation of the work of his teacher, the author named his approach «VKF-method» in his honor. However, in Russian language there is another interpretation — a probabilistic-combinatorial formal ('veroyatnostno kombinatnyi formalnyi') method of Machine Learning based on Lattice Theory.
Now V. K. Finn's group works at Dorodnitsyn Computing Center of Russian Academy of Sciences and at Intelligent Systems Department of Russian State University for Humanities (RSUH).
For more information about the mathematics of the VKF-solver, see dissertations of the author or his video lectures at Ulyanovsk State University (the author is grateful to A. B. Verevkin and N. G. Baranets for organizing lectures and processing their recordings).
The full package of source files is stored on Bitbucket.
Source files (in C++) for the vkf library are in the process of being approved for placement on savannah.nongnu.org. If the decision is positive, the download link will be added here.
Finally, a final note: the author started learning Python on April 6, 2020. Prior to this, the only language he programmed in was C++. But this fact does not absolve him of charges of possible inaccuracy of the code.
The author expresses his heartfelt gratitude to Tatyana A. Volkova robofreak for her support, constructive suggestions, and critical comments that made it possible to significantly improve the presentation (and even significantly simplfy the code). However, the author is solely responsible for the remaining errors and decisions made (even against her advice).
Their training requires constantly increasing volume of samples, and they also do not be able to explain why a particular decision was made. Structural approaches to Machine Learning avoiding these drawbacks exist, the software implementation of one of which is described in the article. This is an English translation of original post by the author.
We describe one of national approaches to Machine Learning, called «VKF-method of Machine Learning based on Lattice Theory». The origin and choice of the name are explained at the end of this article.
1. Method description
The initial system was created by the author as a console C++ application, then it obtained MariaDB DBMS databases support (through the mariadb++ library), then it was converted into a CPython library (using the pybind11 package).
Several datasets from the UCI machine learning repository were selected as approbation of the concept. The mushrooms dataset contains descriptions of 8124 mushrooms in North America, the system achieves 100% results. More precisely, the initial data was randomly divided into a training sample (2,088 edible and 1,944 poisonous mushrooms) and a test sample (2,120 edible and 1,972 poisonous mushrooms). After computing about 100 hypotheses about the causes of edibility, all test cases were predicted correctly. Since the algorithm uses a coupled Markov chain, a sufficient number of hypotheses may vary. It was often enough to generate 50 random hypotheses. Note that when generating the causes of poisonous fungi the number of required hypotheses is grouped around 120, however, all test cases are predicted correctly in this case too.
Kaggle.com has a competition Mushroom Classification where quite a few authors have achieved 100% accuracy. However most of the solutions are neural networks. Our approach allows the mushroom picker to remember about 50 rules. Moreover, the most features are insignificant, hence each hypothesis is a conjunction of a small number of values of essential features, which makes them easy to remember. After that, the human can go for mushrooms without being afraid to take a toadstool or skip an edible mushroom.
Here is a positive hypothesis that leads to assumption of edibility of a mushroom:
[('gill_attachment', 'free'), ('gill_spacing', 'close'), ('gill_size', 'broad'), ('stalk_shape', 'enlarging'), ('stalk_surface_below_ring', 'scaly'), ('veil_type', 'partial'), ('veil_color', 'white'), ('ring_number', 'one'), ('ring_type', 'pendant')]
Please note that only 9 of the 22 features are listed, since the similarity between the edible mushrooms that generate this cause is empty on the remaining 13 attributes.
Second dataset was SPECT Hearts. There, the accuracy of predicting test examples reached 86.1%, which turned out to be slightly higher than the results (84%) of the CLIP3 Machine Learning system (Cover Learning with Integer Programming, version 3), used by the authors of the data. I believe that due to the structure of the description of heart tomograms, which are already pre-encoded by binary attributes, it is unpossible to significantly improve the quality of the forecast.
Recently the author discovered (and implemented) an extension of his approach to the processing of data described by a continuous (numeric) features. In some ways, his approach is similar to the C4.5 system of Decision Trees Learning. This variant was tested on the Wine Quality dataset. The data describes the quality of Portuguese wines. The results are encouraging: if you take high-quality red wines, the hypotheses fully explain their high ratings.
2. Framework choice
Now students at Intelligent systems Department of RSUH is developing a serie of web servers for different research areas (using Nginx + Gunicorn + Django).
Here I'll describe a different variant (based on aiohttp, aiojobs, and aiomysql). The aiomcache module is rejected due to well-known security problems.
Proposed variant has several advantages:
- it uses asynchronous framework aiohttp;
- it admits the Jinja2 templates;
- it works with a pool of connections to DB through aiomysql;
- it generates several processes by aiojobs.aiohttp.spawn.
It has a obvious disadvatages (with respect to Django):
- no Object Relational Mapping (ORM);
- more difficult integration with Nginx as a proxy;
- no Django Template Language (DTL).
Each of the two options targets on different strategies for working with the web server. Synchronous strategy (in Django) is aimed at single-user mode, in which the expert works with a single database at each time. Although the probabilistic procedures of the VKF method are well parallelized, nevertheless, it is theoretically possible that Machine Learning procedures will take a significant amount of time. Therefore, the second option is aimed at several experts, each of whom can simultaneously work (in different browser tabs) with different databases that differ not only in data, but also in the way they are represented (different lattices on the values of discrete features, different significant regressions, and the number of thresholds for continuous ones). In this case, after starting the VKF computation in one tab, the expert can switch to another, where she will prepare or analyze the experiment with other data and/or parameters.
There is an auxiliary (service) database 'vkf' with two tables 'users' and 'experiments' to account for multiple users, experiments, and different stages they are at. The table 'user' stores the login and password of all registered users. The table 'experiments' saves a status of these tables in addition to the names of auxiliary and main tables for each experiment. We rejected aiohttp_session module, because we still need to use the Nginx proxy server to protect critical data.
The structure of the table 'experiments' are the following:
- id int(11) NOT NULL PRIMARY KEY
- expName varchar(255) NOT NULL
- encoder varchar(255)
- goodEncoder tinyint(1)
- lattices varchar(255)
- goodLattices tinyint(1)
- complex varchar(255)
- goodComplex tinyint(1)
- verges varchar(255)
- goodVerges tinyint(1)
- vergesTotal int(11)
- trains varchar(255) NOT NULL
- goodTrains tinyint(1)
- tests varchar(255)
- goodTests tinyint(1)
- hypotheses varchar(255) NOT NULL
- goodHypotheses tinyint(1)
- type varchar(255) NOT NULL
It should be noted that there are some sequences of data preparation for ML experiments, which, unfortunately, differ radically for the discrete and continuous cases.
The case of mixed attributes combines both types of requirements.
discrete: => goodLattices (semi-automatic)
discrete: goodLattices => goodEncoder (automatic)
discrete: goodEncoder => goodTrains (semi-automatic)
discrete: goodEncoder, goodTrains => goodHypotheses (automatic)
discrete: goodEncoder => goodTests (semi-automatic)
discrete: goodTests, goodEncoder, goodHypotheses = > (automatic)
continuous: => goodVerges (manual)
continuous: goodVerges => goodTrains (manual)
continuous: goodTrains => goodComplex (automatic)
continuous: goodComplex, goodTrains => goodHypotheses (automatic)
continuous: goodVerges => goodTests (manual)
continuous: goodTests, goodComplex, goodHypotheses = > (automatic)
Machine Learning library has name vkf.cpython-36m-x86_64-linux-gnu.so under Linux or vkf. cp36-win32.pyd under OS Windows. (36 is a version of Python that this library was built for).
The term «automatic» means using of this library, «semi-automatic» means usage of an auxiliary library 'vkfencoder.cpython-36m-x86_64-linux-gnu.so'. Finally, the «manual» mode corresponds to extern program to process data with continuous features and are now being transferred into the vkfencoder library.
3. Implementation details
We follow the «View/Model/Control» paradigm during web server creation.
Python code is distributed between 5 files:
- app.py — initialization
- control.py — coroutines of Machine Learning procedures
- models.py — data manipulation and DB connections
- settings.py — application settings
- views.py — vizualizations and routes.
File 'app.py' has a standard form:
#! /usr/bin/env python
import asyncio
import jinja2
import aiohttp_jinja2
from settings import SITE_HOST as siteHost
from settings import SITE_PORT as sitePort
from aiohttp import web
from aiojobs.aiohttp import setup
from views import routes
async def init(loop):
app = web.Application(loop=loop)
# install aiojobs.aiohttp
setup(app)
# install jinja2 templates
aiohttp_jinja2.setup(app,
loader=jinja2.FileSystemLoader('./template'))
# add routes from api/views.py
app.router.add_routes(routes)
return app
loop = asyncio.get_event_loop()
try:
app = loop.run_until_complete(init(loop))
web.run_app(app, host=siteHost, port=sitePort)
except:
loop.stop()
I don't think anything needs to be explained here. The next file is 'views.py':
import aiohttp_jinja2
from aiohttp import web#, WSMsgType
from aiojobs.aiohttp import spawn#, get_scheduler
from models import User
from models import Expert
from models import Experiment
from models import Solver
from models import Predictor
routes = web.RouteTableDef()
@routes.view(r'/tests/{name}', name='test-name')
class Predict(web.View):
@aiohttp_jinja2.template('tests.html')
async def get(self):
return {'explanation': 'Please, confirm prediction!'}
async def post(self):
data = await self.request.post()
db_name = self.request.match_info['name']
analogy = Predictor(db_name, data)
await analogy.load_data()
job = await spawn(self.request, analogy.make_prediction())
return await job.wait()
@routes.view(r'/vkf/{name}', name='vkf-name')
class Generate(web.View):
#@aiohttp_jinja2.template('vkf.html')
async def get(self):
db_name = self.request.match_info['name']
solver = Solver(db_name)
await solver.load_data()
context = { 'dbname': str(solver.dbname),
'encoder': str(solver.encoder),
'lattices': str(solver.lattices),
'good_lattices': bool(solver.lattices),
'verges': str(solver.verges),
'good_verges': bool(solver.good_verges),
'complex': str(solver.complex),
'good_complex': bool(solver.good_complex),
'trains': str(solver.trains),
'good_trains': bool(solver.good_trains),
'hypotheses': str(solver.hypotheses),
'type': str(solver.type)
}
response = aiohttp_jinja2.render_template('vkf.html',
self.request, context)
return response
async def post(self):
data = await self.request.post()
step = data.get('value')
db_name = self.request.match_info['name']
if step is 'init':
location = self.request.app.router['experiment-name'].url_for(
name=db_name)
raise web.HTTPFound(location=location)
solver = Solver(db_name)
await solver.load_data()
if step is 'populate':
job = await spawn(self.request, solver.create_tables())
return await job.wait()
if step is 'compute':
job = await spawn(self.request, solver.compute_tables())
return await job.wait()
if step is 'generate':
hypotheses_total = data.get('hypotheses_total')
threads_total = data.get('threads_total')
job = await spawn(self.request, solver.make_induction(
hypotheses_total, threads_total))
return await job.wait()
@routes.view(r'/experiment/{name}', name='experiment-name')
class Prepare(web.View):
@aiohttp_jinja2.template('expert.html')
async def get(self):
return {'explanation': 'Please, enter your data'}
async def post(self):
data = await self.request.post()
db_name = self.request.match_info['name']
experiment = Experiment(db_name, data)
job = await spawn(self.request, experiment.create_experiment())
return await job.wait()
I have shortened this file by dropping classes that serve auxiliary routes:
- The 'Auth' class corresponds to the root route '/' and outputs a request form for user identification. If the user is not registered, there is a SignIn button that redirects the user to the '/signin' route. If a user with the entered username and password is detected, it is redirected to the route '/user/{name}'.
- The 'SignIn' class processes the '/signin' route and returns the user to the root route after successful registration.
- The 'Select' class processes the '/user/{name}' routes and requests which experiment and stage the user wants to perform. After checking for such a DB experiment, the user is redirected to the route '/vkf/{name}' or '/experiment/{name}' (depending on existence of the declared experiment).
The remaining classes correspond routes that are responsible for Machine Learning procedures:
- The 'Prepare' class processes the '/experiment/{name}' routes and collects the names of service tables and numeric parameters necessary to run the VKF-method procedures. After saving this information in the database, the user is redirected to the route '/vkf/{name}'.
- The 'Generate' class processes routes '/vkf/{name}' and starts various stages of the VKF method induction procedure, depending on the data preparation by an expert.
- The 'Predict' class processes the routes '/tests/{name}' and starts the procedure of the VKF prediction by analogy.
To pass a large number of parameters to a form of vkf.html the system uses aiohttp_jinja2 construction
response = aiohttp_jinja2.render_template('vkf.html', self.request, context)
return response
Note the usage of spawn from aiojobs.aiohttp:
job = await spawn(self.request,
solver.make_induction(hypotheses_total, threads_total))
return await job.wait()
This is necessary to safely call coroutines defined in the file 'models.py', processing user and experiment data stored in a database managed by the MariaDB DBMS:
import aiomysql
from aiohttp import web
from settings import AUX_NAME as auxName
from settings import AUTH_TABLE as authTable
from settings import AUX_TABLE as auxTable
from settings import SECRET_KEY as secretKey
from settings import DB_HOST as dbHost
from control import createAuxTables
from control import createMainTables
from control import computeAuxTables
from control import induction
from control import prediction
class Experiment():
def __init__(self, dbName, data, **kw):
self.encoder = data.get('encoder_table')
self.lattices = data.get('lattices_table')
self.complex = data.get('complex_table')
self.verges = data.get('verges_table')
self.verges_total = data.get('verges_total')
self.trains = data.get('training_table')
self.tests = data.get('tests_table')
self.hypotheses = data.get('hypotheses_table')
self.type = data.get('type')
self.auxname = auxName
self.auxtable = auxTable
self.dbhost = dbHost
self.secret = secretKey
self.dbname = dbName
async def create_db(self, pool):
async with pool.acquire() as conn:
async with conn.cursor() as cur:
await cur.execute("CREATE DATABASE IF NOT EXISTS " +
str(self.dbname))
await conn.commit()
await createAuxTables(self)
async def register_experiment(self, pool):
async with pool.acquire() as conn:
async with conn.cursor() as cur:
sql = "INSERT INTO " + str(self.auxname) + "." +
str(self.auxtable)
sql += " VALUES(NULL, '"
sql += str(self.dbname)
sql += "', '"
sql += str(self.encoder)
sql += "', 0, '" #goodEncoder
sql += str(self.lattices)
sql += "', 0, '" #goodLattices
sql += str(self.complex)
sql += "', 0, '" #goodComplex
sql += str(self.verges_total)
sql += "', 0, " #goodVerges
sql += str(self.verges_total)
sql += ", '"
sql += str(self.trains)
sql += "', 0, '" #goodTrains
sql += str(self.tests)
sql += "', 0, '" #goodTests
sql += str(self.hypotheses)
sql += "', 0, '" #goodHypotheses
sql += str(self.type)
sql += "')"
await cur.execute(sql)
await conn.commit()
async def create_experiment(self, **kw):
pool = await aiomysql.create_pool(host=self.dbhost,
user='root', password=self.secret)
task1 = self.create_db(pool=pool)
task2 = self.register_experiment(pool=pool)
tasks = [asyncio.ensure_future(task1),
asyncio.ensure_future(task2)]
await asyncio.gather(*tasks)
pool.close()
await pool.wait_closed()
raise web.HTTPFound(location='/vkf/' + self.dbname)
class Solver():
def __init__(self, dbName, **kw):
self.auxname = auxName
self.auxtable = auxTable
self.dbhost = dbHost
self.dbname = dbName
self.secret = secretKey
async def load_data(self, **kw):
pool = await aiomysql.create_pool(host=dbHost,
user='root', password=secretKey, db=auxName)
async with pool.acquire() as conn:
async with conn.cursor() as cur:
sql = "SELECT * FROM "
sql += str(auxTable)
sql += " WHERE expName='"
sql += str(self.dbname)
sql += "'"
await cur.execute(sql)
row = cur.fetchone()
await cur.close()
pool.close()
await pool.wait_closed()
self.encoder = str(row.result()[2])
self.good_encoder = bool(row.result()[3])
self.lattices = str(row.result()[4])
self.good_lattices = bool(row.result()[5])
self.complex = str(row.result()[6])
self.good_complex = bool(row.result()[7])
self.verges = str(row.result()[8])
self.good_verges = bool(row.result()[9])
self.verges_total = int(row.result()[10])
self.trains = str(row.result()[11])
self.good_trains = bool(row.result()[12])
self.hypotheses = str(row.result()[15])
self.good_hypotheses = bool(row.result()[16])
self.type = str(row.result()[17])
async def create_tables(self, **kw):
await createMainTables(self)
pool = await aiomysql.create_pool(host=self.dbhost, user='root',
password=self.secret, db=self.auxname)
async with pool.acquire() as conn:
async with conn.cursor() as cur:
sql = "UPDATE "
sql += str(self.auxtable)
sql += " SET encoderStatus=1 WHERE dbname='"
sql += str(self.dbname)
sql += "'"
await cur.execute(sql)
await conn.commit()
await cur.close()
pool.close()
await pool.wait_closed()
raise web.HTTPFound(location='/vkf/' + self.dbname)
async def compute_tables(self, **kw):
await computeAuxTables(self)
pool = await aiomysql.create_pool(host=self.dbhost, user='root',
password=self.secret, db=self.auxname)
async with pool.acquire() as conn:
async with conn.cursor() as cur:
sql = "UPDATE "
sql += str(self.auxtable)
sql += " SET complexStatus=1 WHERE dbname='"
sql += str(self.dbname)
sql += "'"
await cur.execute(sql)
await conn.commit()
await cur.close()
pool.close()
await pool.wait_closed()
raise web.HTTPFound(location='/vkf/' + self.dbname)
async def make_induction(self, hypotheses_total, threads_total, **kw):
await induction(self, hypotheses_total, threads_total)
pool = await aiomysql.create_pool(host=self.dbhost, user='root',
password=self.secret, db=self.auxname)
async with pool.acquire() as conn:
async with conn.cursor() as cur:
sql = "UPDATE "
sql += str(self.auxtable)
sql += " SET hypothesesStatus=1 WHERE dbname='"
sql += str(self.dbname)
sql += "'"
await cur.execute(sql)
await conn.commit()
await cur.close()
pool.close()
await pool.wait_closed()
raise web.HTTPFound(location='/tests/' + self.dbname)
class Predictor():
def __init__(self, dbName, data, **kw):
self.auxname = auxName
self.auxtable = auxTable
self.dbhost = dbHost
self.dbname = dbName
self.secret = secretKey
self.plus = 0
self.minus = 0
async def load_data(self, **kw):
pool = await aiomysql.create_pool(host=dbHost, user='root',
password=secretKey, db=auxName)
async with pool.acquire() as conn:
async with conn.cursor() as cur:
sql = "SELECT * FROM "
sql += str(auxTable)
sql += " WHERE dbname='"
sql += str(self.dbname)
sql += "'"
await cur.execute(sql)
row = cur.fetchone()
await cur.close()
pool.close()
await pool.wait_closed()
self.encoder = str(row.result()[2])
self.good_encoder = bool(row.result()[3])
self.complex = str(row.result()[6])
self.good_complex = bool(row.result()[7])
self.verges = str(row.result()[8])
self.trains = str(row.result()[11])
self.tests = str(row.result()[13])
self.good_tests = bool(row.result()[14])
self.hypotheses = str(row.result()[15])
self.good_hypotheses = bool(row.result()[16])
self.type = str(row.result()[17])
async def make_prediction(self, **kw):
if self.good_tests and self.good_hypotheses:
await induction(self, 0, 1)
await prediction(self)
message_body = str(self.plus)
message_body += " correct positive cases. "
message_body += str(self.minus)
message_body += " correct negative cases."
raise web.HTTPException(body=message_body)
else:
raise web.HTTPFound(location='/vkf/' + self.dbname)
Again some auxiliary classes are ommited:
- The 'User' class corresponds to a site user. It allows user to register and log in as an expert.
- The 'Expert' class allows exptert to select one of the experiments.
The remaining classes correspond to the main procedures:
- The 'Experiment' class allows expert to set the names of key and auxiliary tables and parameters necessary for conducting VKF experiments.
- The 'Solver' class is responsible for inductive generalization in the VKF method.
- The 'Predictor' class is responsible for predictions by analogy in the VKF method.
It is important to use create_pool() procedure from aiomysql. It creates multiple connections to a database simulteneously. To safe termination of database communication, the system uses insure_future() and gather() procedures from the asyncio module.
pool = await aiomysql.create_pool(host=self.dbhost,
user='root', password=self.secret)
task1 = self.create_db(pool=pool)
task2 = self.register_experiment(pool=pool)
tasks = [asyncio.ensure_future(task1),
asyncio.ensure_future(task2)]
await asyncio.gather(*tasks)
pool.close()
await pool.wait_closed()
Construction row = cur.fetchone() returns future, hence row.result() corresponds to record of the table from which field values can be extracted (for example, str(row.result()[2]) extracts the table name with encoding of discrete feature values).
pool = await aiomysql.create_pool(host=dbHost, user='root',
password=secretKey, db=auxName)
async with pool.acquire() as conn:
async with conn.cursor() as cur:
await cur.execute(sql)
row = cur.fetchone()
await cur.close()
pool.close()
await pool.wait_closed()
self.encoder = str(row.result()[2])
Key system parameters are imported from file '.env' or (if it absent) from file 'settings.py' directly.
from os.path import isfile
from envparse import env
if isfile('.env'):
env.read_envfile('.env')
AUX_NAME = env.str('AUX_NAME', default='vkf')
AUTH_TABLE = env.str('AUTH_TABLE', default='users')
AUX_TABLE = env.str('AUX_TABLE', default='experiments')
DB_HOST = env.str('DB_HOST', default='127.0.0.1')
DB_HOST = env.str('DB_PORT', default=3306)
DEBUG = env.bool('DEBUG', default=False)
SECRET_KEY = env.str('SECRET_KEY', default='toor')
SITE_HOST = env.str('HOST', default='127.0.0.1')
SITE_PORT = env.int('PORT', default=8080)
It is important to note that localhost must be specified by ip address, otherwise aiomysql will try to connect to the database via a Unix socket, which may not work under OS Windows.
Finally, file 'control.py' has a form:
import os
import asyncio
import vkf
async def createAuxTables(db_data):
if db_data.type is not "discrete":
await vkf.CAttributes(db_data.verges, db_data.dbname,
'127.0.0.1', 'root', db_data.secret)
if db_data.type is not "continuous":
await vkf.DAttributes(db_data.encoder, db_data.dbname,
'127.0.0.1', 'root', db_data.secret)
await vkf.Lattices(db_data.lattices, db_data.dbname,
'127.0.0.1', 'root', db_data.secret)
async def createMainTables(db_data):
if db_data.type is "continuous":
await vkf.CData(db_data.trains, db_data.verges,
db_data.dbname, '127.0.0.1', 'root', db_data.secret)
await vkf.CData(db_data.tests, db_data.verges,
db_data.dbname, '127.0.0.1', 'root', db_data.secret)
if db_data.type is "discrete":
await vkf.FCA(db_data.lattices, db_data.encoder,
db_data.dbname, '127.0.0.1', 'root', db_data.secret)
await vkf.DData(db_data.trains, db_data.encoder,
db_data.dbname, '127.0.0.1', 'root', db_data.secret)
await vkf.DData(db_data.tests, db_data.encoder,
db_data.dbname, '127.0.0.1', 'root', db_data.secret)
if db_data.type is "full":
await vkf.FCA(db_data.lattices, db_data.encoder,
db_data.dbname, '127.0.0.1', 'root', db_data.secret)
await vkf.FData(db_data.trains, db_data.encoder, db_data.verges,
db_data.dbname, '127.0.0.1', 'root', db_data.secret)
await vkf.FData(db_data.tests, db_data.encoder, db_data.verges,
db_data.dbname,'127.0.0.1', 'root', db_data.secret)
async def computeAuxTables(db_data):
if db_data.type is not "discrete":
async with vkf.Join(db_data.trains, db_data.dbname, '127.0.0.1',
'root', db_data.secret) as join:
await join.compute_save(db_data.complex, db_data.dbname,
'127.0.0.1', 'root', db_data.secret)
await vkf.Generator(db_data.complex, db_data.trains, db_data.verges,
db_data.dbname, db_data.dbname, db_data.verges_total, 1,
'127.0.0.1', 'root', db_data.secret)
async def induction(db_data, hypothesesNumber, threadsNumber):
if db_data.type is not "discrete":
qualifier = await vkf.Qualifier(db_data.verges,
db_data.dbname, '127.0.0.1', 'root', db_data.secret)
beget = await vkf.Beget(db_data.complex, db_data.dbname,
'127.0.0.1', 'root', db_data.secret)
if db_data.type is not "continuous":
encoder = await vkf.Encoder(db_data.encoder, db_data.dbname,
'127.0.0.1', 'root', db_data.secret)
async with vkf.Induction() as induction:
if db_data.type is "continuous":
await induction.load_continuous_hypotheses(qualifier, beget,
db_data.trains, db_data.hypotheses, db_data.dbname,
'127.0.0.1', 'root', db_data.secret)
if db_data.type is "discrete":
await induction.load_discrete_hypotheses(encoder,
db_data.trains, db_data.hypotheses, db_data.dbname,
'127.0.0.1', 'root', db_data.secret)
if db_data.type is "full":
await induction.load_full_hypotheses(encoder, qualifier, beget,
db_data.trains, db_data.hypotheses, db_data.dbname,
'127.0.0.1', 'root', db_data.secret)
if hypothesesNumber > 0:
await induction.add_hypotheses(hypothesesNumber, threadsNumber)
if db_data.type is "continuous":
await induction.save_continuous_hypotheses(qualifier,
db_data.hypotheses, db_data.dbname, '127.0.0.1', 'root',
db_data.secret)
if db_data.type is "discrete":
await induction.save_discrete_hypotheses(encoder,
db_data.hypotheses, db_data.dbname, '127.0.0.1', 'root',
db_data.secret)
if db_data.type is "full":
await induction.save_full_hypotheses(encoder, qualifier,
db_data.hypotheses, db_data.dbname, '127.0.0.1', 'root',
db_data.secret)
async def prediction(db_data):
if db_data.type is not "discrete":
qualifier = await vkf.Qualifier(db_data.verges,
db_data.dbname, '127.0.0.1', 'root', db_data.secret)
beget = await vkf.Beget(db_data.complex, db_data.dbname,
'127.0.0.1', 'root', db_data.secret)
if db_data.type is not "continuous":
encoder = await vkf.Encoder(db_data.encoder,
db_data.dbname, '127.0.0.1', 'root', db_data.secret)
async with vkf.Induction() as induction:
if db_data.type is "continuous":
await induction.load_continuous_hypotheses(qualifier, beget,
db_data.trains, db_data.hypotheses, db_data.dbname,
'127.0.0.1', 'root', db_data.secret)
if db_data.type is "discrete":
await induction.load_discrete_hypotheses(encoder,
db_data.trains, db_data.hypotheses, db_data.dbname,
'127.0.0.1', 'root', db_data.secret)
if db_data.type is "full":
await induction.load_full_hypotheses(encoder, qualifier, beget,
db_data.trains, db_data.hypotheses, db_data.dbname,
'127.0.0.1', 'root', db_data.secret)
if db_data.type is "continuous":
async with vkf.TestSample(qualifier, induction, beget,
db_data.tests, db_data.dbname, '127.0.0.1', 'root',
db_data.secret) as tests:
#plus = await tests.correct_positive_cases()
db_data.plus = await tests.correct_positive_cases()
#minus = await tests.correct_negative_cases()
db_data.minus = await tests.correct_negative_cases()
if db_data.type is "discrete":
async with vkf.TestSample(encoder, induction,
db_data.tests, db_data.dbname, '127.0.0.1', 'root',
db_data.secret) as tests:
#plus = await tests.correct_positive_cases()
db_data.plus = await tests.correct_positive_cases()
#minus = await tests.correct_negative_cases()
db_data.minus = await tests.correct_negative_cases()
if db_data.type is "full":
async with vkf.TestSample(encoder, qualifier, induction,
beget, db_data.tests, db_data.dbname, '127.0.0.1',
'root', db_data.secret) as tests:
#plus = await tests.correct_positive_cases()
db_data.plus = await tests.correct_positive_cases()
#minus = await tests.correct_negative_cases()
db_data.minus = await tests.correct_negative_cases()
I retain this file, since here you can see the names of and call order of arguments of the VKF method procedures from the library 'vkf.cpython-36m-x86_64-linux-gnu.so'. All arguments after dbname can be omitted, since the default values in the CPython library are set to standard values.
4. Some comments
Anticipating the question of professional programmers about why the logic of controlling the VKF experiment is brought out (through numerous if), and not hidden through polymorphism in types, we should answer this: unfortunately, dynamic typing of the Python language does not allow you to remove the decision about the type of object used, that is, in any case, this sequence of nested if will occur. Therefore, the author preferred to use explicit (C-like) syntax to make the logic as transparent (and efficient) as possible.
Let me comment on the missing components:
- Now databases for discrete attributes only experiments are prepared through using the library 'vkfencoder.cpython-36m-x86_64-linux-gnu.so' (students make the web interface for it, and the author calls the corresponding methods directly, since he is still working on the localhost). For continuous features, work is underway to incorporate corresponding methods in 'vkfencoder.cpython-36m-x86_64-linux-gnu.so' too.
- Hypotheses are currently displayed by third-party MariaDB client programs (the author uses DBeaver 7.1.1 Community, but there are a large number of analogues). Students are developing a prototype system using the Django framework, where ORM will allow to view of hypotheses in a convenient way for experts.
5. History of the method
The author has been engaged in data mining for more than 30 years. After graduating Mathematics Department of Lomonosov Moscow State University, he was invited to a group of researchers under the leadership of Professor Victor K. Finn (VINITI SSSR Academy of Science). Viktor K. Finn has been researching plausible reasoning and its formalization by means of multi-valued logics since the early 80's of the last century.
The key ideas proposed by V. K. Finn are the following:
- Using the binary similarity operation (originally, the intersection operation in Boolean algebra);
- The idea of rejecting the generated similarity of a group of training examples if it is embedded in an example of the opposite sign (counter-example);
- The idea of predicting the target property of test examples by taking into account the arguments for and against;
- The idea of checking the completeness of a set of hypotheses by finding the reasons (among the generated similarities) for the presence or absence of a target property for every training example.
It should be noted that V. K. Finn attributes some of his ideas to foreign authors. Perhaps only the logic of argumentation is rightfully considered to have been invented by himself. The idea of accounting for counter-examples V.K. Finn borrowed, according to him, from K.R. Popper. The origins of verification of the completeness of inductive generalization were (completely obscure, in my opinion) works of the American mathematician and logician C.S. Peirce. He considers the generation of hypotheses about causes using the similarity operation to be borrowed from the ideas of the British economist, philosopher and logician J.S. Mill. Therefore, he created a set of ideas called «JSM-method» in honor of J.S. Mill.
Strange, but much more useful ideas of Professor Rudolf Wille (Germany) which are appeared in the late 70-ies of the XX century and form a modern part of algebraic Lattice Theory (so-called Formal Concept Analysis, FCA) are not respected by Professor V.K. Finn. In my opinion, the reason for this is its unfortunate name, hence it is rejected by a person who graduated first from the faculty of philosophy, and then from the Engineer requalification program at Mathematics Department of Lomonosov Moscow State University.
As a continuation of the work of his teacher, the author named his approach «VKF-method» in his honor. However, in Russian language there is another interpretation — a probabilistic-combinatorial formal ('veroyatnostno kombinatnyi formalnyi') method of Machine Learning based on Lattice Theory.
Now V. K. Finn's group works at Dorodnitsyn Computing Center of Russian Academy of Sciences and at Intelligent Systems Department of Russian State University for Humanities (RSUH).
For more information about the mathematics of the VKF-solver, see dissertations of the author or his video lectures at Ulyanovsk State University (the author is grateful to A. B. Verevkin and N. G. Baranets for organizing lectures and processing their recordings).
The full package of source files is stored on Bitbucket.
Source files (in C++) for the vkf library are in the process of being approved for placement on savannah.nongnu.org. If the decision is positive, the download link will be added here.
Finally, a final note: the author started learning Python on April 6, 2020. Prior to this, the only language he programmed in was C++. But this fact does not absolve him of charges of possible inaccuracy of the code.
The author expresses his heartfelt gratitude to Tatyana A. Volkova robofreak for her support, constructive suggestions, and critical comments that made it possible to significantly improve the presentation (and even significantly simplfy the code). However, the author is solely responsible for the remaining errors and decisions made (even against her advice).