Data Mining *

Deep data analysis

ArticlesPostsNewsAuthors

kentavr009 Jul 29 2024 at 06:41

Data labeling – training on cats

Easy

8 min

578

Data Mining*Big Data*

Tutorial

Translation

At some point while diving deeper into automation processes you are faced with the need for data labeling, although just a couple of weeks ago, the phrases data labeling and you were standing at a party called "Earnings on the Internet" in different rooms. Or it would be better to say that you were standing by the pool, and the data labeling was on the third floor, smoking on the balcony with experts in the field of machine learning. How did we meet? Probably, someone pushed it off the balcony into the pool, and I helped it out, soaking my clothes along the way.

SergeyBPshenichnikov Feb 6 2024 at 14:12

ALGEBRA OF MUSICAL TEXT

Medium

5 min

375

Natural Language Processing*Data Mining*Entertaining tasksMathematics*

FAQ

Translation

Sergey Pshenichnikov, Tatiana Sotnikova

ALGEBRA OF MUSICAL TEXT

Sergey Pshenichnikov, Tatiana Sotnikova

Trio Sapiens

Musical text can be represented using matrix units, like the description of verbal texts and other symbolic sequences. In the future, mathematical recognition, and creation of musical sense with substantive justification for intermediate calculations (as opposed to AI) may become possible.

Sound has four properties: pitch, duration, volume, and timbre. Timbre is not considered yet. The dictionary of the algebra of musical texts is built on the basis of musical notation for the piano.

The duration here, for the sake of brevity of the first presentation, is considered as «absolute». «Relative» is not considered, although intervals are very well studied, and their features will be needed to categorize composers.

The complexity of the musical text for the application of mathematics is explained by the desire to simplify the reading of musical notes by musicians and to minimize the use of lower and upper additional lines.

To apply text algebra to musical symbolic sequences there is no need to use a five-line staff. What is useful and familiar to musicians is «unbearably harmful» for the use of algebra. It seems advisable to use a one-line staff. In this case, the musical text becomes like the verbal text.

To solve the problem, you need to find a transformation of the canonical musical text into a «thread». And as always, for a new application of algebra, correct coordination of the subject area is necessary. In this case, each used musical notation and symbol of modern musical notation must be assigned its own serial number (natural number).

Instead of a sign, you can use the names of each note symbol - then it will be a verbal notation of musical texts written in one line «thread»).

Since the musical scale is completely represented by piano keys, the first section in height of the dictionary of musical texts consists of 88 numbered white and black keys (of which 52 are white). This eliminates the need for an octave division of the scale, octave transfer signs, keys, five alteration signs (key and random), diatonic and chromatic semitones.

All notes of the scale became fundamental in algebraic musical notation. There is an order of magnitude more of them of them than the main stages of Guido Aretinsky, but the alteration signs and names of octaves disappeared, the use of which made musical texts algebraically incompatible with verbal texts. Numbers from 1 to 88 in algebraic notation constitute a fragment of the pitch dictionary for the «thread» one-line staff.

Numbering (coordination) of notes is needed to become in the future indices of mathematical objects (matrix units), which will replace the signs of notes or their names. These matrix units are binary generalizations of integers (hyperbinary numbers). The operation of division with remainder is defined for them, as for integers. The operation will allow you to divide musical texts and their f

SergeyBPshenichnikov Feb 6 2024 at 13:48

ALGEBRA OF SENSE

Medium

12 min

235

Natural Language Processing*Data Mining*

FAQ

Translation

Sergey Pshenichnikov

Sign sequences (for example, verbal and musical texts) can be turned into mathematical objects. Words and numbers have become one entity, a representation of a matrix unit, which is a matrix generalization of integers and a hypercomplex number. A matrix unit is a matrix in which one element is equal to unit, and the rest are zeros.

If the words of the text are represented by such matrices, then concatenation (combination while maintaining order) of words and texts becomes an operation of adding matrices.

You can perform transformations with texts using algebraic operations, for example, dividing one text by another with a remainder. Mathematically recognize the sense of text and calculate the context of words. In this case, algebra helps to interpret all the intermediate stages of calculations.

A person sees and hears only what he understands (J.W. Goethe). Understands what he attaches sense to as significant for him. Sense is subjective and depends on the interests, motivations, and feelings of different people.

L. S. Vygotsky distinguished between the concepts of «sense» and «meaning»: «if the «meaning» of a word is an objective reflection of a system of connections and relationships, then « sense» is the introduction of subjective aspects of meaning according to a given moment and situation».

According to G. Frege, «meaning» are properties, relationships of objects, «sense» is only part of these properties. In this case, both “meanings” and «sense» are called one «sign», for example a word. Two people can choose from a list of meanings for one word two non-overlapping fragments (two senses) to interpret it.

IgKend Oct 21 2022 at 18:42

How Yandex Made Their Biggest Improvement in the Search Engine with the Help of Toloka

5 min

2.3K

Search engines*Data Mining*Machine learning*Artificial IntelligenceData Engineering*

Tutorial

Toloka is a crowdsourcing platform and microtasking project launched by Yandex to quickly markup large amounts of data. But how can such a simple concept play a crucial role in improving the work of neural networks?

Learn how

vldmrvslv Jun 29 2022 at 14:24

Detecting attempts of mass influencing via social networks using NLP. Part 2

3 min

1.1K

Data Mining*Twitter API*Natural Language Processing*Python*Big Data*

Tutorial

In Part 1 of this article, I built and compared two classifiers to detect trolls on Twitter. You can check it out here.

Now, time has come to look more deeply into the datasets to find some patterns using exploratory data analysis and topic modelling.

EDA

To do just that, I first created a word cloud of the most common words, which you can see below.

vldmrvslv Jun 29 2022 at 14:20

Detecting attempts of mass influencing via social networks using NLP. Part 1

5 min

1.6K

Twitter API*Natural Language Processing*Data Mining*Python*Big Data*

Tutorial

During the last decades, the world’s population has been developing as an information society, which means that information started to play a substantial end-to-end role in all life aspects and processes. In view of the growing demand for a free flow of information, social networks have become a force to be reckoned with. The ways of war-waging have also changed: instead of conventional weapons, governments now use political warfare, including fake news, a type of propaganda aimed at deliberate disinformation or hoaxes. And the lack of content control mechanisms makes it easy to spread any information as long as people believe in it.

Based on this premise, I’ve decided to experiment with different NLP approaches and build a classifier that could be used to detect either bots or fake content generated by trolls on Twitter in order to influence people.

In this first part of the article, I will cover the data collection process, preprocessing, feature extraction, classification itself and the evaluation of the models’ performance. In Part 2, I will dive deeper into the troll problem, conduct exploratory analysis to find patterns in the trolls’ behaviour and define the topics that seemed of great interest to them back in 2016.

Features for analysis

From all possible data to use (like hashtags, account language, tweet text, URLs, external links or references, tweet date and time), I settled upon English tweet text, Russian tweet text and hashtags. Tweet text is the main feature for analysis because it contains almost all essential characteristics that are typical for trolling activities in general, such as abuse, rudeness, external resources references, provocations and bullying. Hashtags were chosen as another source of textual information as they represent the central message of a tweet in one or two words.

AlexZus Oct 1 2021 at 16:27

Millions of orders per second matching engine testing

4 min

10K

Big Data*C++*Data Engineering*Data Mining*

From sandbox

I had some experience in the matching engine development for cryptocurrency exchange some time ago. That was an interesting and challenging experience. I developed it in clear C++ from scratch. The testing of it is also quite a challenging task. You need to get data for testing, perform testing, collect some statistics, and at last, analyze collected data to find weak points and bottlenecks. I want to focus on testing the C++ matching engine and show how testing can give insights for optimizations even without the need to change the code. The matching engine I developed can do more than 1’000’000 TPS (transactions per second) and is 10x times faster than the matching engine of the Binance cryptocurrency exchange (see one post on Binance Blog).

NIX_Solutions May 6 2021 at 13:43

Benefits of Hybrid Data Lake: How to combine Data Warehouse with Data Lake

4 min

2.5K

NIX corporate blogData Mining*Data Engineering*

Hey, hey! I am Ilya Kalchenko, a Data Engineer at NIX, a fan of big and small data processing, and Python. In this article, I want to discuss the benefits of hybrid data lakes for efficient and secure data organization.

To begin with, I invite you to figure out the concepts of Data Warehouses and Data Lake. Let’s delve into the use cases and delimit areas of responsibility.

FizpokPak Feb 1 2021 at 10:51

Coins classifier Neural Network: Head or Tail?

14 min

1.6K

Big Data*Data Engineering*Data Mining*Python*TensorFlow*

Home of this article: https://robotics.snowcron.com/coins/02_head_or_tail.htm

The global objective of these articles is to build a coin classifier, capable of scanning your pocket change and find rare / valuable coins. This is a second article in a series, so let me remind you what happened earlier (https://habr.com/ru/post/538958/).

During previous step we got a rather large dataset composed of pairs of images, loaded from an online coins site meshok.ru. Those images were uploaded to the Internet by people we do not know, and though they are supposed to contain coin's head in one image and tail in the other, we can not rule out a situation when we have two heads and no tail and vice versa. Also at the moment we have no idea which image contains head and which contains tail: this might be important when we feed data to our final classifier.

So let's write a program to distinguish heads from tails. It is a rather simple task, involving a convolutional neural network that is using transfer learning.

Same way as before, we are going to use Google Colab environment, taking the advantage of a free video card they grant us an access to. We will store data on a Google Drive, so first thing we need is to allow Colab to access the Drive:

ipolynkina Jan 22 2021 at 09:17

How PVS-Studio Checked ELKI in January

9 min

760

PVS-Studio corporate blogOpen source*Java*Data Mining*

If you feel like the New Year just came, and you missed the first half of January, then all this time you've been busy looking for tricky bugs in the code you maintain. It also means that our article is what you need. PVS-Studio has checked the ELKI open source project to show you errors that may occur in the code, how cunningly they can hide there, and how you can deal with them.

ELKI/image1.png

S0mbre Sep 21 2020 at 06:34

Crime, Race and Lethal Force in the USA — Part 3

24 min

1.7K

Big Data*Data Mining*Open source*Python*

Translation

This is the concluding part of my article devoted to a statistical analysis of police shootings and criminality among the white and the black population of the United States. In the first part, we talked about the research background, goals, assumptions, and source data; in the second part, we investigated the national use-of-force and crime data and tracked their connection with race.

S0mbre Sep 18 2020 at 02:00

Crime, Race and Lethal Force in the USA — Part 2

14 min

2.3K

Big Data*Data Mining*Open source*Python*

Translation

In the previous part of this article, I talked about the research background, goals, assumptions, source data, and used tools. Today, without further ado, let's say together…

S0mbre Sep 18 2020 at 02:00

Crime, Race and Lethal Force in the USA — Part 1

8 min

2.6K

Python*Open source*Data Mining*

Translation

Do the police in the US really shoot black people more often than white people? Is use of lethal force connected with race? How is crime related to race? What are the odds of getting shot by the police if you are white and if you are black? We're taking public data and python with pandas to shed some light on these questions, propaganda and politics set far aside.

Octoparsehola Jul 7 2020 at 14:23

10 Best Email Scraping Tools for Sales Prospecting in 2020

3 min

2.2K

Data Mining*

From sandbox

We all know how hard it is to build an email sales list from scratch, especially for small companies. There left no options due to limited resources. In fact, many companies even buy preset profiled lists from the third party and send identical mass emails. It can put your business in a vulnerable position ascribed into the low quality of the email lists. However, there is a better way to build a highly targeted email list with email scraping tools.

Email scraping can help you collect email addresses shown publicly using a bot. What makes this great is that you have control over where to get the email lists from, and who can opt-in. Moreover, you don’t have to rely on the second-hand source. I profiled a list of best 10 email scraping tools for sales prospecting. Let’s take a look.

1. Zoominfo

A full-featured email scraping platform with a comprehensive database. You can directly search for titles and companies within their platform. It is more like a directory system that covers professionals in all industries with contact information. Email lists are the assets. That said, it comes with a price tag. It is worth to invest if you are looking for accurate sales leads. Zoominfo is an excellent option for enterprise-level sales prospects.

veesot Jul 3 2020 at 12:33

How to find an English teacher. Part 1

5 min

1.6K

Data Mining*Data visualization*Natural Language Processing*Programming*Python*

In the modern world, here and there ideas are arising about using data science for an extra benefit. For instance, Google can use a history of watched videos for providing recommendations about new ones. Online shops are using a recommendation system for increasing your receipt. However… if companies use the data for their benefit, could we do the same for own needs such as looking an online English teacher?

Disclaimer

It is an approach based on my own experience and can be unsuitable to your point of view, ideas, or principles.

mal_mal_shay May 10 2020 at 20:25

Approach to calculating individual risk in COVID-19

3 min

1.2K

C*Data Mining*Python*

In February 2020, when the disease came to Europe, it became apparent to me that our timid hopes that the epidemics would subside and be finally buried in the China's soil were ruined. It was already evident from the Chinese statistics that the virus is lethal enough to scare and mild enough to pass unnoticed in many cases and, thus, to guarantee its effective dissemination. The question was when it reaches each next country.

Another question was the individual risks, especially the risk of lethal outcome if one contracts the virus. The average figure of around 5% was circulated by late January and early February. It was known that males were more susceptible to fatal outcomes. By February, it was also evident that the virus doesn't lead to death only in the elderly — the middle age was significantly affected, as well.

codezombie Apr 15 2020 at 14:55

COVID YAAA! or Yet Another Analyze Attempt

11 min

1.3K

Data Mining*R*Data visualization*HealthMachine learning*

Hello, Habr!

About a month ago, I had a feeling of constant anxiety. I began to eat poorly, sleep even worse, and constantly read to a ton of news about the pandemic. Based on them, the coronavirus either captured, or liberated our planet, was either a conspiracy of world governments, or the vengeance of the pangolin, the virus either threatened everyone at once, or personally me and my sleeping cat…

Hundreds of articles, social media posts, youtube-telegram-instagram-tik-tok (yes, I sin) content of varying degrees of content quality did not lead me to anything but an even greater sense of anxiety.

But one day I ~~bought buckwheat~~ decided to end it all. As soon as possible!

What did you do?

-1

kentavr009 Apr 12 2020 at 15:49

«Build it & Break it»: How some algorithms generate captcha, while others crack it

12 min

3.9K

Data Mining*

From sandbox

Hello, Habr! Let's me present you a translation of an article "«Ломай меня полностью!» Как одни алгоритмы генерируют капчу, а другие её взламывают", author miroslavmirm.

Doesn't matter what kind of intelligence you have — be it artificial or natural — after this detailed analysis no captcha will be an obstacle. At the end of the article, you can find the simplest and most effective workaround solution.

CAPTCHA is a completely automated public Turing test to tell computers and humans apart by automatically setting up specific tasks that are difficult for computers but simple for human. This technology has become the security standard used to prevent automatic voting, registration, spam, brute-force attacks on websites, etc.

ilmarin77 Mar 1 2020 at 07:28

Using Data Science for house hunting in Montreal

7 min

4.8K

DIYData Mining*R*

Introduction

I happen to live in Montreal, in my condo on the edge of McGill Ghetto. Close to Saint Laurent Boulevard or the Maine as locals call it, with all it's attractions — bars, restaurants, night clubs, drunken students. And once upon a time, on a particular lively night, listening to the sounds of McGill frosh students drunkenly heading home after hard night of studying. I thought, that it might be a good idea to move into my own house, a little bit further away from the action.

empenoso Feb 7 2020 at 13:53

Free API Moscow Stock Exchange (MOEX) in Google Sheets

2 min

10K

API*Data Mining*Google API*Algorithms*Finance in IT

Last year the number of private investors at Moscow Stock Exchange (MOEX) has doubled and reached 3.86 million: about 1.9 million people have opened accounts at MOEX in 2019. The Saint Petersburg Stock Exchange which specializes in trading of foreign company shares has seen its accounts increase three times from 910,000 to 3,06 million over the past year.

This means that almost 2 million newbies without any actual trading experience and lacking any specialized software for trading/position analysis have entered the market.