Big Data *

Everything about big data

ArticlesPostsNewsAuthors

SvyatoslavMC Oct 22 2019 at 07:05

Analyzing the Code of ROOT, Scientific Data Analysis Framework

14 min

2.5K

PVS-Studio corporate blogBig Data*C*C++*Open source*

While Stockholm was holding the 118th Nobel Week, I was sitting in our office, where we develop the PVS-Studio static analyzer, working on an analysis review of the ROOT project, a big-data processing framework used in scientific research. This code wouldn't win a prize, of course, but the authors can definitely count on a detailed review of the most interesting defects plus a free license to thoroughly check the project on their own.

Introduction

ROOT is a modular scientific software toolkit. It provides all the functionalities needed to deal with big data processing, statistical analysis, visualisation and storage. It is mainly written in C++. ROOT was born at CERN, at the heart of the research on high-energy physics. Every day, thousands of physicists use ROOT applications to analyze their data or to perform simulations.

+22

stefanbuzz Aug 15 2019 at 07:06

PVS-Studio Visits Apache Hive

12 min

1.2K

PVS-Studio corporate blogInformation Security*Open source*Java*Big Data*

For the past ten years, the open-source movement has been one of the key drivers of the IT industry's development, and its crucial component. The role of open-source projects is becoming more and more prominent not only in terms of quantity but also in terms of quality, which changes the very concept of how they are positioned on the IT market in general. Our courageous PVS-Studio team is not sitting idly and is taking an active part in strengthening the presence of open-source software by finding hidden bugs in the enormous depths of codebases and offering free license options to the authors of such projects. This article is just another piece of that activity! Today we are going to talk about Apache Hive. I've got the report — and there are things worth looking at.

+17

sismetanin Aug 1 2019 at 10:35

Contextual Emotion Detection in Textual Conversations Using Neural Networks

10 min

3.9K

VK corporate blogBig Data*Data Mining*Python*Machine learning*

Nowadays, talking to conversational agents is becoming a daily routine, and it is crucial for dialogue systems to generate responses as human-like as possible. As one of the main aspects, primary attention should be given to providing emotionally aware responses to users. In this article, we are going to describe the recurrent neural network architecture for emotion detection in textual conversations, that participated in SemEval-2019 Task 3 “EmoContext”, that is, an annual workshop on semantic evaluation. The task objective is to classify emotion (i.e. happy, sad, angry, and others) in a 3-turn conversational data set.

+37

o6CuFl2Q Jun 25 2019 at 14:42

How to speed up LZ4 decompression in ClickHouse?

23 min

17K

Яндекс corporate blogBig Data*C++*Open source*High performance*

When you run queries in ClickHouse, you might notice that the profiler often shows the LZ_decompress_fast function near the top. What is going on? This question had us wondering how to choose the best compression algorithm.

ClickHouse stores data in compressed form. When running queries, ClickHouse tries to do as little as possible, in order to conserve CPU resources. In many cases, all the potentially time-consuming computations are already well optimized, plus the user wrote a well thought-out query. Then all that's left to do is to perform decompression.

So why does LZ4 decompression becomes a bottleneck? LZ4 seems like an extremely light algorithm: the data decompression rate is usually from 1 to 3 GB/s per processor core, depending on the data. This is much faster than the typical disk subsystem. Moreover, we use all available CPU cores, and decompression scales linearly across all physical cores.

+19

sismetanin Apr 30 2019 at 08:42

Google News and Leo Tolstoy: visualizing Word2Vec word embeddings using t-SNE

7 min

14K

VK corporate blogBig Data*Python*Data visualization*Machine learning*

Everyone uniquely perceives texts, regardless of whether this person reads news on the Internet or world-known classic novels. This also applies to a variety of algorithms and machine learning techniques, which understand texts in a more mathematical way, namely, using high-dimensional vector space.

This article is devoted to visualizing high-dimensional Word2Vec word embeddings using t-SNE. The visualization can be useful to understand how Word2Vec works and how to interpret relations between vectors captured from your texts before using them in neural networks or other machine learning algorithms. As training data, we will use articles from Google News and classical literary works by Leo Tolstoy, the Russian writer who is regarded as one of the greatest authors of all time.

We go through the brief overview of t-SNE algorithm, then move to word embeddings calculation using Word2Vec, and finally, proceed to word vectors visualization with t-SNE in 2D and 3D space. We will write our scripts in Python using Jupyter Notebook.

+28