
Toloka is a crowdsourcing platform and microtasking project launched by Yandex to quickly markup large amounts of data. But how can such a simple concept play a crucial role in improving the work of neural networks?
Deep data analysis
Toloka is a crowdsourcing platform and microtasking project launched by Yandex to quickly markup large amounts of data. But how can such a simple concept play a crucial role in improving the work of neural networks?
In Part 1 of this article, I built and compared two classifiers to detect trolls on Twitter. You can check it out here.
Now, time has come to look more deeply into the datasets to find some patterns using exploratory data analysis and topic modelling.
EDA
To do just that, I first created a word cloud of the most common words, which you can see below.
During the last decades, the world’s population has been developing as an information society, which means that information started to play a substantial end-to-end role in all life aspects and processes. In view of the growing demand for a free flow of information, social networks have become a force to be reckoned with. The ways of war-waging have also changed: instead of conventional weapons, governments now use political warfare, including fake news, a type of propaganda aimed at deliberate disinformation or hoaxes. And the lack of content control mechanisms makes it easy to spread any information as long as people believe in it.
Based on this premise, I’ve decided to experiment with different NLP approaches and build a classifier that could be used to detect either bots or fake content generated by trolls on Twitter in order to influence people.
In this first part of the article, I will cover the data collection process, preprocessing, feature extraction, classification itself and the evaluation of the models’ performance. In Part 2, I will dive deeper into the troll problem, conduct exploratory analysis to find patterns in the trolls’ behaviour and define the topics that seemed of great interest to them back in 2016.
Features for analysis
From all possible data to use (like hashtags, account language, tweet text, URLs, external links or references, tweet date and time), I settled upon English tweet text, Russian tweet text and hashtags. Tweet text is the main feature for analysis because it contains almost all essential characteristics that are typical for trolling activities in general, such as abuse, rudeness, external resources references, provocations and bullying. Hashtags were chosen as another source of textual information as they represent the central message of a tweet in one or two words.
I had some experience in the matching engine development for cryptocurrency exchange some time ago. That was an interesting and challenging experience. I developed it in clear C++ from scratch. The testing of it is also quite a challenging task. You need to get data for testing, perform testing, collect some statistics, and at last, analyze collected data to find weak points and bottlenecks. I want to focus on testing the C++ matching engine and show how testing can give insights for optimizations even without the need to change the code. The matching engine I developed can do more than 1’000’000 TPS (transactions per second) and is 10x times faster than the matching engine of the Binance cryptocurrency exchange (see one post on Binance Blog).
Hey, hey! I am Ilya Kalchenko, a Data Engineer at NIX, a fan of big and small data processing, and Python. In this article, I want to discuss the benefits of hybrid data lakes for efficient and secure data organization.
To begin with, I invite you to figure out the concepts of Data Warehouses and Data Lake. Let’s delve into the use cases and delimit areas of responsibility.
Home of this article: https://robotics.snowcron.com/coins/02_head_or_tail.htm
The global objective of these articles is to build a coin classifier, capable of scanning your pocket change and find rare / valuable coins. This is a second article in a series, so let me remind you what happened earlier (https://habr.com/ru/post/538958/).
During previous step we got a rather large dataset composed of pairs of images, loaded from an online coins site meshok.ru. Those images were uploaded to the Internet by people we do not know, and though they are supposed to contain coin's head in one image and tail in the other, we can not rule out a situation when we have two heads and no tail and vice versa. Also at the moment we have no idea which image contains head and which contains tail: this might be important when we feed data to our final classifier.
So let's write a program to distinguish heads from tails. It is a rather simple task, involving a convolutional neural network that is using transfer learning.
Same way as before, we are going to use Google Colab environment, taking the advantage of a free video card they grant us an access to. We will store data on a Google Drive, so first thing we need is to allow Colab to access the Drive:
If you feel like the New Year just came, and you missed the first half of January, then all this time you've been busy looking for tricky bugs in the code you maintain. It also means that our article is what you need. PVS-Studio has checked the ELKI open source project to show you errors that may occur in the code, how cunningly they can hide there, and how you can deal with them.
In the modern world, here and there ideas are arising about using data science for an extra benefit. For instance, Google can use a history of watched videos for providing recommendations about new ones. Online shops are using a recommendation system for increasing your receipt. However… if companies use the data for their benefit, could we do the same for own needs such as looking an online English teacher?
It is an approach based on my own experience and can be unsuitable to your point of view, ideas, or principles.
In February 2020, when the disease came to Europe, it became apparent to me that our timid hopes that the epidemics would subside and be finally buried in the China's soil were ruined. It was already evident from the Chinese statistics that the virus is lethal enough to scare and mild enough to pass unnoticed in many cases and, thus, to guarantee its effective dissemination. The question was when it reaches each next country.
Another question was the individual risks, especially the risk of lethal outcome if one contracts the virus. The average figure of around 5% was circulated by late January and early February. It was known that males were more susceptible to fatal outcomes. By February, it was also evident that the virus doesn't lead to death only in the elderly — the middle age was significantly affected, as well.
Hello, Habr!
About a month ago, I had a feeling of constant anxiety. I began to eat poorly, sleep even worse, and constantly read to a ton of news about the pandemic. Based on them, the coronavirus either captured, or liberated our planet, was either a conspiracy of world governments, or the vengeance of the pangolin, the virus either threatened everyone at once, or personally me and my sleeping cat…
Hundreds of articles, social media posts, youtube-telegram-instagram-tik-tok (yes, I sin) content of varying degrees of content quality did not lead me to anything but an even greater sense of anxiety.
But one day I bought buckwheat decided to end it all. As soon as possible!
I happen to live in Montreal, in my condo on the edge of McGill Ghetto. Close to Saint Laurent Boulevard or the Maine as locals call it, with all it's attractions — bars, restaurants, night clubs, drunken students. And once upon a time, on a particular lively night, listening to the sounds of McGill frosh students drunkenly heading home after hard night of studying. I thought, that it might be a good idea to move into my own house, a little bit further away from the action.
Photo by Dugan Arnett on Boston Globe
Are you still looking for a new flat? Ready to make the last attempt? If so - follow me and I show you how to reach the finish line.
Have you thought about the influence of the nearest metro to the price of your flat?
What about several kindergartens around your apartment? Are you ready to plunge in the world of geo-spatial data?
Have you ever looked for a flat? Would you like to add some machine learning and make a process more interesting?
Information