Big Data *

Everything about big data

ArticlesPostsNewsAuthors

sismetanin Aug 1 2019 at 10:35

Contextual Emotion Detection in Textual Conversations Using Neural Networks

10 min

3.9K

VK corporate blogBig Data * Data Mining * Python * Machine learning *

Nowadays, talking to conversational agents is becoming a daily routine, and it is crucial for dialogue systems to generate responses as human-like as possible. As one of the main aspects, primary attention should be given to providing emotionally aware responses to users. In this article, we are going to describe the recurrent neural network architecture for emotion detection in textual conversations, that participated in SemEval-2019 Task 3 “EmoContext”, that is, an annual workshop on semantic evaluation. The task objective is to classify emotion (i.e. happy, sad, angry, and others) in a 3-turn conversational data set.

+37

sismetanin Apr 30 2019 at 08:42

Google News and Leo Tolstoy: visualizing Word2Vec word embeddings using t-SNE

7 min

14K

VK corporate blogBig Data * Python * Data visualization * Machine learning *

Everyone uniquely perceives texts, regardless of whether this person reads news on the Internet or world-known classic novels. This also applies to a variety of algorithms and machine learning techniques, which understand texts in a more mathematical way, namely, using high-dimensional vector space.

This article is devoted to visualizing high-dimensional Word2Vec word embeddings using t-SNE. The visualization can be useful to understand how Word2Vec works and how to interpret relations between vectors captured from your texts before using them in neural networks or other machine learning algorithms. As training data, we will use articles from Google News and classical literary works by Leo Tolstoy, the Russian writer who is regarded as one of the greatest authors of all time.

We go through the brief overview of t-SNE algorithm, then move to word embeddings calculation using Word2Vec, and finally, proceed to word vectors visualization with t-SNE in 2D and 3D space. We will write our scripts in Python using Jupyter Notebook.

+28

SvyatoslavMC Oct 22 2019 at 07:05

Analyzing the Code of ROOT, Scientific Data Analysis Framework

14 min

2.5K

PVS-Studio corporate blogOpen source * C++ * C * Big Data *

While Stockholm was holding the 118th Nobel Week, I was sitting in our office, where we develop the PVS-Studio static analyzer, working on an analysis review of the ROOT project, a big-data processing framework used in scientific research. This code wouldn't win a prize, of course, but the authors can definitely count on a detailed review of the most interesting defects plus a free license to thoroughly check the project on their own.

Introduction

ROOT is a modular scientific software toolkit. It provides all the functionalities needed to deal with big data processing, statistical analysis, visualisation and storage. It is mainly written in C++. ROOT was born at CERN, at the heart of the research on high-energy physics. Every day, thousands of physicists use ROOT applications to analyze their data or to perform simulations.

+22

o6CuFl2Q Jun 25 2019 at 14:42

How to speed up LZ4 decompression in ClickHouse?

23 min

17K

Яндекс corporate blogBig Data * C++ * Open source * High performance *

When you run queries in ClickHouse, you might notice that the profiler often shows the LZ_decompress_fast function near the top. What is going on? This question had us wondering how to choose the best compression algorithm.

ClickHouse stores data in compressed form. When running queries, ClickHouse tries to do as little as possible, in order to conserve CPU resources. In many cases, all the potentially time-consuming computations are already well optimized, plus the user wrote a well thought-out query. Then all that's left to do is to perform decompression.

So why does LZ4 decompression becomes a bottleneck? LZ4 seems like an extremely light algorithm: the data decompression rate is usually from 1 to 3 GB/s per processor core, depending on the data. This is much faster than the typical disk subsystem. Moreover, we use all available CPU cores, and decompression scales linearly across all physical cores.

+19

stefanbuzz Aug 15 2019 at 07:06

PVS-Studio Visits Apache Hive

12 min

1.2K

PVS-Studio corporate blogBig Data * Java * Open source * Information Security *

For the past ten years, the open-source movement has been one of the key drivers of the IT industry's development, and its crucial component. The role of open-source projects is becoming more and more prominent not only in terms of quantity but also in terms of quality, which changes the very concept of how they are positioned on the IT market in general. Our courageous PVS-Studio team is not sitting idly and is taking an active part in strengthening the presence of open-source software by finding hidden bugs in the enormous depths of codebases and offering free license options to the authors of such projects. This article is just another piece of that activity! Today we are going to talk about Apache Hive. I've got the report — and there are things worth looking at.

+17

snakers4 Sep 17 2020 at 16:48

Modern Google-level STT Models Released

2 min

5.5K

Big Data * SoundMachine learning * Start-up development

We are proud to announce that we have built from ground up and released our high-quality (i.e. on par with premium Google models) speech-to-text Models for the following languages:

English;
German;
Spanish;

You can find all of our models in our repository together with examples, quality and performance benchmarks. Also we invested some time into making our models as accessible as possible — you can try our examples as well as PyTorch, ONNX, TensorFlow checkpoints. You can also load our model via TorchHub.

	PyTorch	ONNX	TensorFlow	Quality
English (en_v1)	✓	✓	✓	link
German (de_v1)	✓	✓	✓	link
Spanish (es_v1)	✓	✓	✓	link

olegchir Oct 15 2020 at 13:48

Distributed File Systems

9 min

10K

JetBrains corporate blogBig Data * Java *

The Big Data Tools plugin seamlessly integrates HDFS into your IDE and provides access to different cloud storage systems (AWS S3, Minio, Linode, Digital Open Space, GS, Azure). But is this the end? Have we implemented everything and now progress has stopped? Of course not.

In this short digest, we'll take a look at 15 popular distributed file systems available on the market and try to get a sense of their individual advantages.

Almost all of these systems are free or open-source, and you can find the sources on GitHub. The sites of these projects, their documentation, and online reviews provide most of the information we’ll consider here. Other than HDFS, none of these technologies have been implemented yet in Big Data Tools. But who knows? Perhaps someday we'll see them in our plugin.

olegchir Oct 9 2020 at 13:55

Big Data Tools Update 11 Is Out

3 min

1.8K

JetBrains corporate blogBig Data * Java * Scala *

EAP 11 of the Big Data Tools plugin for IntelliJ IDEA Ultimate, PyCharm, and DataGrip is available starting today. You can install it from the JetBrains Plugin Repository or inside your IDE.

Big Data Tools is a new JetBrains plugin that allows you to connect to Hadoop and Spark clusters and monitor nodes, applications, and jobs. It also brings support for editing and running Zeppelin notebooks inside IntelliJ IDEA and DataGrip, so you can create, edit, and run Zeppelin notebooks without ever having to leave your favorite IDE. The plugin offers smart navigation, code completion, inspections, quick-fixes, and refactoring inside notebooks.

PastorGL Jan 31 2020 at 16:55

Introducing One Ring — an open-source pipeline for all your Spark applications

23 min

1.6K

Big Data * Data Engineering * Hadoop * Java * Open source *

If you utilize Apache Spark, you probably have a few applications that consume some data from external sources and produce some intermediate result, that is about to be consumed by some applications further down the processing chain, and so on until you get a final result.

We suspect that because we have a similar pipeline with lots of processes like this one:

A process flowchart with more than 50 applications and about 70 datasets
Click here for a bit larger version

Each rectangle is a Spark application with a set of their own execution parameters, and each arrow is an equally parametrized dataset (externally stored highlighted with a color; note the number of intermediate ones). This example is not the most complex of our processes, it’s fairly a simple one. And we don’t assemble such workflows manually, we generate them from Process Templates (outlined as groups on this flowchart).

So here comes the One Ring, a Spark pipelining framework with very robust configuration abilities, which makes it easier to compose and execute a most complex Process as a single large Spark job.

And we just made it open source. Perhaps, you’re interested in the details.

We got you covered!

o6CuFl2Q Jan 23 2020 at 16:50

Five Methods For Database Obfuscation

20 min

7.6K

Яндекс corporate blogBig Data * Open source * Algorithms * Machine learning *

ClickHouse users already know that its biggest advantage is its high-speed processing of analytical queries. But claims like this need to be confirmed with reliable performance testing. That's what we want to talk about today.

We started running tests in 2013, long before the product was available as open source. Back then, just like now, our main concern was data processing speed in Yandex.Metrica. We had been storing that data in ClickHouse since January of 2009. Part of the data had been written to a database starting in 2012, and part was converted from OLAPServer and Metrage (data structures previously used by Yandex.Metrica). For testing, we took the first subset at random from data for 1 billion pageviews. Yandex.Metrica didn't have any queries at that point, so we came up with queries that interested us, using all the possible ways to filter, aggregate, and sort the data.

ClickHouse performance was compared with similar systems like Vertica and MonetDB. To avoid bias, testing was performed by an employee who hadn't participated in ClickHouse development, and special cases in the code were not optimized until all the results were obtained. We used the same approach to get a data set for functional testing.

After ClickHouse was released as open source in 2016, people began questioning these tests.

Evrone Nov 16 2022 at 11:13

How we designed the user interface for an enterprise analytical system

5 min

Singula Team corporate blogBig Data * CGI * Data Engineering *

In 2021, we were contacted by an industrial plant that was faced with the need to create a system for analyzing processes in its production. The enterprise team studied ready-made solutions, but none of the analytics system designs fully covered the required functionality. So they turned to us with a request to develop their own analytical system that would collect data from all machines and allow it to be analyzed to see bottlenecks in production. For this project, we created a data-driven UI/UX design and also developed a web-based interface for the equipment monitoring system.

AlexZus Oct 1 2021 at 16:27

Millions of orders per second matching engine testing

4 min

10K

Data Mining * Data Engineering * C++ * Big Data *

From sandbox

I had some experience in the matching engine development for cryptocurrency exchange some time ago. That was an interesting and challenging experience. I developed it in clear C++ from scratch. The testing of it is also quite a challenging task. You need to get data for testing, perform testing, collect some statistics, and at last, analyze collected data to find weak points and bottlenecks. I want to focus on testing the C++ matching engine and show how testing can give insights for optimizations even without the need to change the code. The matching engine I developed can do more than 1’000’000 TPS (transactions per second) and is 10x times faster than the matching engine of the Binance cryptocurrency exchange (see one post on Binance Blog).

snakers4 Dec 5 2020 at 09:55

Playing with Nvidia's New Ampere GPUs and Trying MIG

11 min

4.4K

Image processing * Machine learning * Computer hardwareNatural Language Processing * Big Data *

Every time when the essential question arises, whether to upgrade the cards in the server room or not, I look through similar articles and watch such videos.

Channel with the aforementioned video is very underestimated, but the author does not deal with ML. In general, when analyzing comparisons of accelerators for ML, several things usually catch your eye:

The authors usually take into account only the "adequacy" for the market of new cards in the United States;
The ratings are far from the people and are made on very standard networks (which is probably good overall) without details;
The popular mantra to train more and more gigantic models makes its own adjustments to the comparison;

The answer to the question "which card is better?" is not rocket science: Cards of the 20* series didn't get much popularity, while the 1080 Ti from Avito (Russian craigslist) still are very attractive (and, oddly enough, don't get cheaper, probably for this reason).

All this is fine and dandy and the standard benchmarks are unlikely to lie too much, but recently I learned about the existence of Multi-Instance-GPU technology for A100 video cards and native support for TF32 for Ampere devices and I got the idea to share my experience of the real testing cards on the Ampere architecture (3090 and A100). In this short note, I will try to answer the questions:

Is the upgrade to Ampere worth it? (spoiler for the impatient — yes);
Are the A100 worth the money (spoiler — in general — no);
Are there any cases when the A100 is still interesting (spoiler — yes);
Is MIG technology useful (spoiler — yes, but for inference and for very specific cases for training);

BeyersJulia Nov 18 2019 at 15:45

Cryptocurrency trading — How to develop a sustainable strategy

3 min

1.4K

Big Data *

The cryptocurrency space is incredibly popular among investors and traders alike. Many of the most popular crypto assets have grown a lot in recent years. Let’s take Bitcoin.

sashalisik Sep 6 2019 at 10:27

How we created IoT system for managing solar energy usage

5 min

1.3K

Smart HouseIOTSystem Analysis and Design * IT Infrastructure * Big Data *

If you have no idea about the development architecture and mechanical/electrical design behind IoT solutions, they could seem like "having seemingly supernatural qualities or powers". For example, if you show a working IoT system to 18th century people, they'd think it's magic.This article is sort of busting such myth. Or, to put it more technically, about hints for fine-tuning the IoT development for an awesome project in solar energy management area.

Hayk-Asoyan Dec 4 2023 at 08:35

TeleDrive: Unleash Unlimited Cloud Storage with Telegram

Medium

2 min

9.6K

Big Data * Open data * Data storage *

Hey everyone! Today, I'll guide you through creating a boundless cloud storage solution on Telegram using TeleDrive. This open-source project is a game-changer, offering functionalities like Google Drive/OneDrive via the Telegram API.

kotsev96 Feb 10 2023 at 13:38

Message broker selection cheat sheet: Kafka vs RabbitMQ vs Amazon SQS

Medium

6 min

12K

Big Data * Go * Java *

This is a series of articles dedicated to the optimal choice between different systems on a real project or an architectural interview.

At work or at a System Design interview, you often have to choose the best message broker. I plunged into this issue and will tell you what and why. What is better in each case, what are the advantages and disadvantages of these systems, and which one to choose, I will show with several examples.

stefanbuzz Dec 18 2019 at 08:55

Apache Hadoop Code Quality: Production VS Test

11 min

703

PVS-Studio corporate blogOpen source * Java * Hadoop * Big Data *

In order to get high quality production code, it's not enough just to ensure maximum coverage with tests. No doubts, great results require the main project code and tests to work efficiently together. Therefore, tests have to be paid as much attention as the main code. A decent test is a key success factor, as it will catch regression in production. Let's take a look at PVS-Studio static analyzer warnings to see the importance of the fact that errors in tests are no worse than the ones in production. Today's focus: Apache Hadoop.

msgeek Oct 1 2019 at 07:00

What's new in ML.NET and Model Builder

2 min

Microsoft corporate blogMachine learning * Artificial IntelligenceBig Data * .NET *

We are excited to announce updates to Model Builder and improvements in ML.NET. You can learn more in the «What’s new in ML.NET?.» session at .NET Conf.

ML.NET is an open-source and cross-platform machine learning framework (Windows, Linux, macOS) for .NET developers.

ML.NET offers Model Builder (a simple UI tool) and CLI to make it super easy to build custom ML Models using AutoML.

Using ML.NET, developers can leverage their existing tools and skillsets to develop and infuse custom AI into their applications by creating custom machine learning models for common scenarios like Sentiment Analysis, Recommendation, Image Classification and more!..

torgeek Apr 1 2019 at 20:32

How to write the home address right?

16 min

1.4K

XML * OpenStreetMap * NoSQL * Big Data *

How Tax Service, OpenStreetMap, and InterSystems IRIS
could help developers get clean addresses

Pieter Brueghel the Younger, Paying the Tax (The Tax Collector), 1640

In my previous article, we just skimmed the surface of objects. Let's continue our reconnaissance. Today's topic is a tough one. It's not quite BIG DATA, but it's still the data not easy to work with: we're talking about fairly large amounts of data. It won't all fit into RAM at once, and some of it won't even fit on the drive (not due to lack of space, but because there's a lot of junk). The name of our subject is FIAS DB: the Federal Information Address System database — the databases of addresses in Russia. The archive is 5.5 GB. And it's a compressed XML file. After extraction, it will be a full 53 GB (set aside 110 GB for extraction). And when you start to parse and convert it, that 110 GB won't be enough. There won't be enough RAM either.

2 3

Big Data *

Introduction

How Tax Service, OpenStreetMap, and InterSystems IRIS could help developers get clean addresses

How Tax Service, OpenStreetMap, and InterSystems IRIS
could help developers get clean addresses