Big Data *

Everything about big data

ArticlesPostsNewsAuthors

kentavr009 Jul 29 2024 at 06:41

Data labeling – training on cats

Easy

8 min

608

Data Mining*Big Data*

Tutorial

Translation

At some point while diving deeper into automation processes you are faced with the need for data labeling, although just a couple of weeks ago, the phrases data labeling and you were standing at a party called "Earnings on the Internet" in different rooms. Or it would be better to say that you were standing by the pool, and the data labeling was on the third floor, smoking on the balcony with experts in the field of machine learning. How did we meet? Probably, someone pushed it off the balcony into the pool, and I helped it out, soaking my clothes along the way.

vldmrvslv Jun 29 2022 at 14:24

Detecting attempts of mass influencing via social networks using NLP. Part 2

3 min

1.2K

Python*Natural Language Processing*Twitter API*Data Mining*Big Data*

Tutorial

In Part 1 of this article, I built and compared two classifiers to detect trolls on Twitter. You can check it out here.

Now, time has come to look more deeply into the datasets to find some patterns using exploratory data analysis and topic modelling.

EDA

To do just that, I first created a word cloud of the most common words, which you can see below.

vldmrvslv Jun 29 2022 at 14:20

Detecting attempts of mass influencing via social networks using NLP. Part 1

5 min

1.6K

Twitter API*Natural Language Processing*Data Mining*Python*Big Data*

Tutorial

During the last decades, the world’s population has been developing as an information society, which means that information started to play a substantial end-to-end role in all life aspects and processes. In view of the growing demand for a free flow of information, social networks have become a force to be reckoned with. The ways of war-waging have also changed: instead of conventional weapons, governments now use political warfare, including fake news, a type of propaganda aimed at deliberate disinformation or hoaxes. And the lack of content control mechanisms makes it easy to spread any information as long as people believe in it.

Based on this premise, I’ve decided to experiment with different NLP approaches and build a classifier that could be used to detect either bots or fake content generated by trolls on Twitter in order to influence people.

In this first part of the article, I will cover the data collection process, preprocessing, feature extraction, classification itself and the evaluation of the models’ performance. In Part 2, I will dive deeper into the troll problem, conduct exploratory analysis to find patterns in the trolls’ behaviour and define the topics that seemed of great interest to them back in 2016.

Features for analysis

From all possible data to use (like hashtags, account language, tweet text, URLs, external links or references, tweet date and time), I settled upon English tweet text, Russian tweet text and hashtags. Tweet text is the main feature for analysis because it contains almost all essential characteristics that are typical for trolling activities in general, such as abuse, rudeness, external resources references, provocations and bullying. Hashtags were chosen as another source of textual information as they represent the central message of a tweet in one or two words.

Tott Apr 21 2021 at 10:37

You are standing at a red light at an empty intersection. How to make traffic lights smarter?

14 min

2.2K

Python*IT Infrastructure*Big Data*

Types of smart traffic lights: adaptive and neural networks

Adaptive works at relatively simple intersections, where the rules and possibilities for switching phases are quite obvious. Adaptive management is only applicable where there is no constant loading in all directions, otherwise it simply has nothing to adapt to – there are no free time windows. The first adaptive control intersections appeared in the United States in the early 70s of the last century. Unfortunately, they have reached Russia only now, their number according to some estimates does not exceed 3,000 in the country.

Neural networks – a higher level of traffic regulation. They take into account a lot of factors at once, which are not even always obvious. Their result is based on self-learning: the computer receives live data on the bandwidth and selects the maximum value by all possible algorithms, so that in total, as many vehicles as possible pass from all sides in a comfortable mode per unit of time. How this is done, usually programmers answer – we do not know, the neural network is a black box, but we will reveal the basic principles to you…

Adaptive traffic lights use, at least, leading companies in Russia, rather outdated technology for counting vehicles at intersections: physical sensors or video background detector. A capacitive sensor or an induction loop only sees the vehicle at the installation site-for a few meters, unless of course you spend millions on laying them along the entire length of the roadway. The video background detector shows only the filling of the roadway with vehicles relative to this roadway. The camera should clearly see this area, which is quite difficult at a long distance due to the perspective and is highly susceptible to atmospheric interference: even a light snowstorm will be diagnosed as the presence of traffic – the background video detector does not distinguish the type of detection.

olegchir Oct 6 2020 at 11:42

ZTools for Apache Zeppelin

8 min

1.4K

JetBrains corporate blogBig Data*Java*Scala*

Zeppelin is a web-based notebook for data engineers that enables data-driven, interactive data analytics with Spark, Scala, and more.

The project recently reached version 0.9.0-preview2 and is being actively developed, but there are still many things to be implemented.

One such thing is an API for getting comprehensive information about what's going on inside the notebook. There is already an API that completely solves the problems of high-level notebook management, but it doesn’t help if you want to do anything more complex.

S0mbre Sep 21 2020 at 06:34

Crime, Race and Lethal Force in the USA — Part 3

24 min

1.7K

Big Data*Data Mining*Open source*Python*

Translation

This is the concluding part of my article devoted to a statistical analysis of police shootings and criminality among the white and the black population of the United States. In the first part, we talked about the research background, goals, assumptions, and source data; in the second part, we investigated the national use-of-force and crime data and tracked their connection with race.

S0mbre Sep 18 2020 at 02:00

Crime, Race and Lethal Force in the USA — Part 2

14 min

2.3K

Big Data*Data Mining*Open source*Python*

Translation

In the previous part of this article, I talked about the research background, goals, assumptions, source data, and used tools. Today, without further ado, let's say together…

Farwit Aug 6 2020 at 06:14

The Project «Fabula»: How to find the desired video-fragment or person in a pile of video files?

2 min

1.7K

Спецлаб corporate blogBig Data*IPTV*Video equipmentDisplay advertising*

If a person is far over 20, then he has already accumulated a huge film library of his life, as well as videos from friends, relatives, and from his place of work… It is no longer possible to find someone or something specific there. Recently, I was preparing a video compilation for my daughter's anniversary – I spent a week. The media is all the more overloaded with video archives. And every day, millions of terabytes of video content appear in the world. And this is in the era of BIG DATA.

ShifaMartin May 5 2020 at 11:54

Reach Out Top Hadoop Consulting Companies To Leverage Big Data In 2020

7 min

1.2K

Big Data*Hadoop*

Hadoop is divided into different modules, each of which delivers a distinct task crucial for a computer system and is uniquely designed for big data analytics. Apache Software Foundation developed this incredible platform. It is extensively utilized by worldwide developers to build big data Hadoop solutions amazingly and easily.

Big data offers several perks, some of them are; examining root causes of failures, recognizing the potential of data-driven marketing, improving and enhancing customer engagement, and much more. By offering multiple solutions in a single stream it helps in lowering the cost of the organization.

In various industries such as Retail, Manufacturing, Financial insurance, Education, Transportation, Agriculture, Healthcare, Energy, etc big data is utilized and that’s why it’s demand is expanding day by day. The Global Hadoop Market is envisioned to grow to $84.6 billion by 2021, with an expected CAGR of 63.4%.

Mgrin Apr 11 2019 at 21:48

How to generate a huge financial graph with money laundering patterns?

4 min

Big Data*Python*Abnormal programming*Open data*

Couple of years ago my team (compliance in one of Swiss banks) and I had an interesting task to implement — we had to generate a huge random graph of financial transactions between clients, companies and ATMs. Moreover, we wanted this graph to contain some money-laundering and other financial crime patterns alongside with nodes description such as names, addresses, currencies etc. Obviously, all data should be randomly generated from scratch as long as we could not use any real data for obvious reasons.

As a solution we wrote a generator that I’d love to share with you. This article explains why we needed it and how this generator is working, but if you don’t want to read and want to try it on your own here is the code: https://github.com/MGrin/transactions-graph-generator. I hope that our experience will be helpful to any of you.

ArcaneGamingcom Dec 5 2024 at 15:45

How to Choose the Optimal Authentication Solution for Your Application

Medium

3 min

1.6K

API*Asterisk*Big Data*Data Engineering*Email-marketing*

Retrospective

In today's digital world, where applications process increasing amounts of sensitive data, ensuring reliable user authentication is critical. Authentication is the process of verifying the identity of a user who is trying to access a system. A properly chosen authentication method protects data from unauthorized access, prevents fraud, and increases user confidence.

However, with the development of technology, new authentication methods are emerging, and choosing the optimal solution can be difficult. This article will help developers and business owners understand the variety of authentication approaches and make informed choices.

profleaddev May 20 2024 at 15:21

New ChatGPT-4o: A Game-Changer That Could Replace Data Analysts, Demo Included

Easy

3 min

2.2K

Artificial IntelligenceData visualization*Big Data*

Opinion

In this article, I’m going to discuss something really important. If you’re a data analyst or you want to learn data analysis, please watch this video till the end because it’s really important.

profleaddev Feb 28 2024 at 15:06

Master Data Analysis with ChatGPT — How to Analyze Anything (Beginners Guide)

Easy

3 min

2.5K

Big Data*Data visualization*Artificial Intelligence

Tutorial

Today we’re diving into an exciting feature within ChatGPT that has the potential to enhance your productivity by 10, 20, 30, or even 40%. If you’re keen on learning how to leverage this feature to your advantage, make sure to read this article until the end. This feature stands out because it allows you to analyze almost anything by uploading your data and posing various questions to ChatGPT. Whether it's business data, your resume, or any other information you wish to explore, ChatGPT is here to deliver answers based on your specific dataset.

snakers4 Oct 6 2021 at 14:20

We have published a model for text repunctuation and recapitalization for four languages

7 min

7.3K

Machine learning*Python*Natural Language Processing*Big Data*

Working with speech recognition models we often encounter misconceptions among potential customers and users (mostly related to the fact that people have a hard time distinguishing substance over form). People also tend to believe that punctuation marks and spaces are somehow obviously present in spoken speech, when in fact real spoken speech and written speech are entirely different beasts.

Of course you can just start each sentence with a capital letter and put a full stop at the end. But it is preferable to have some relatively simple and universal solution for "restoring" punctuation marks and capital letters in sentences that our speech recognition system generates. And it would be really nice if such a system worked with any texts in general.

For this reason, we would like to share a system that:

Inserts capital letters and basic punctuation marks (dot, comma, hyphen, question mark, exclamation mark, dash for Russian);
Works for 4 languages (Russian, English, German, Spanish) and can be extended;
By design is domain agnostic and is not based on any hard-coded rules;
Has non-trivial metrics and succeeds in the task of improving text readability;

To reiterate — the purpose of such a system is only to improve the readability of the text. It does not add information to the text that did not originally exist.

m31 Jun 10 2021 at 09:48

DataScience Digest — 10.06.21

5 min

1.3K

Machine learning*Artificial IntelligenceBig Data*Algorithms*Python*

The new issue of DataScienceDigest is here!

Machine learning in healthcare, the top 10 TED talks on AI, fraud detection in Uber, DatasetGAN, Text-to-Image generation via transformers, and more…

Andrey2008 Jan 16 2020 at 12:11

Machine Learning in Static Analysis of Program Source Code

27 min

PVS-Studio corporate blogBig Data*Artificial IntelligenceMachine learning*Programming*

Machine Learning in Static Analysis of Program Source Code

Machine learning has firmly entrenched in a variety of human fields, from speech recognition to medical diagnosing. The popularity of this approach is so great that people try to use it wherever they can. Some attempts to replace classical approaches with neural networks turn up unsuccessful. This time we'll consider machine learning in terms of creating effective static code analyzers for finding bugs and potential vulnerabilities.

Falcon_eye Jul 24 2024 at 21:15

How to set up Apache Airflow for 10 minutes via Docker

Medium

2 min

Data Engineering*Python*Big Data*

Tutorial

Prerequisites:
1. Install Docker
2. Install VSCode

STEP BY STEP

1. Open VSCode that you previously installed and click on "Extensions" tab right on the menu bar, then type 'docker' to find proper extension and click "install":

Ninil Apr 1 2024 at 19:10

User-defined aggregation functions in Spark

Medium

6 min

Data Engineering*Big Data*Scala*

Below, we will discuss user-defined aggregation functions (UDAF) using org.apache.spark.sql.expressions.Aggregator, which can be used for aggregating groups of elements in a DataSet into a single value in any user-defined way.

Let’s start by examining an example from the official documentation that implements a simple aggregation

N-Cube Feb 15 2023 at 13:35

PyGMTSAR is Next Generation Interferometric Synthetic Aperture Radar (InSAR) Software for Everyone

6 min

3.1K

Do you need to produce satellite interferometry results for your work or study? Or should you find the way to process terabytes of radar data on your common laptop? Maybe you aren't confident about the installation and usage of the required software. Fortunately, there is the next generation of satellite interferometry products available for you. Beginners can build the results easily and advanced users might work on huge datasets. Open Source software PyGMTSAR is available on GitHub for developers and on DockerHub for advanced users and on Google Colab for everyone. This is the cloud-ready product, and it works the same as do you run it locally on your old laptop as on powerful cloud servers.

m31 Jan 26 2023 at 17:43

Data Phoenix Digest — ISSUE 2.2023

2 min

1.1K

Python*Big Data*Machine learning*DevOps*Artificial Intelligence

Digest

Video recording of our webinar about dstack and reproducible ML workflows, AVL binary tree operations, Ultralytics YOLOv8, training XGBoost, productionize ML models, introduction to forecasting ensembles, domain expansion of image generators, Muse, X-Decoder, Box2Mask, RoDynRF, AgileAvatar and more.