Pull to refresh

Big Data *

Everything about big data

Show first
Rating limit
Level of difficulty

Machine Learning in Static Analysis of Program Source Code

Reading time 27 min
Views 2.7K
PVS-Studio corporate blog Programming *Big Data *Machine learning *Artificial Intelligence

Machine Learning in Static Analysis of Program Source Code

Machine learning has firmly entrenched in a variety of human fields, from speech recognition to medical diagnosing. The popularity of this approach is so great that people try to use it wherever they can. Some attempts to replace classical approaches with neural networks turn up unsuccessful. This time we'll consider machine learning in terms of creating effective static code analyzers for finding bugs and potential vulnerabilities.
Read more →
Total votes 2: ↑2 and ↓0 +2
Comments 0

How Ecommerce Fueled By the Pillars of AI Technology

Reading time 4 min
Views 730
Development of mobile applications *Big Data *Product Management *Software Artificial Intelligence

At present, we see artificial intelligence is implemented across the corridors of business operations and also the way we shop and trade online. To hit a home run in the retail game, genius AI applications, PIM solutions, and e-commerce development tools are now offering smart solutions: predictive analysis, recommendation engines, inventory management, and warehouse automation to create a more profitable shopping experience for consumers.

Now more than ever, e-commerce is an AI innovation game

Artificial Intelligence often sometimes seems complicated to newbies but in reality, it is simple in use and gives you the ability to predict customer needs. This paves the way for e-commerce companies to become a “big brand” or “big business” with revolutionary AI tools.

Now that AI algorithms making way for consumer acceptance of AI like never before, how can you use it to create more profitable outcomes in e-commerce?

Interesting E-commerce Stats:

With an estimated global population of 7.7 billion, 25 percent of people shopping through e-commerce stores. According to Statista, 52% of e-commerce stores will have omnichannel capabilities by 2020 which means they can communicate and sell with their consumers via multiple channels. For example, they can use their e-commerce website, Facebook e-shop, email account, and Instagram account.

Examples of AI tools and PIM software for e-commerce businesses that can help them have a high bar on customer service and marketing:
Read more →
Total votes 1: ↑0 and ↓1 -1
Comments 0

Apache Hadoop Code Quality: Production VS Test

Reading time 11 min
Views 590
PVS-Studio corporate blog Open source *Java *Big Data *Hadoop *

Рисунок 1

In order to get high quality production code, it's not enough just to ensure maximum coverage with tests. No doubts, great results require the main project code and tests to work efficiently together. Therefore, tests have to be paid as much attention as the main code. A decent test is a key success factor, as it will catch regression in production. Let's take a look at PVS-Studio static analyzer warnings to see the importance of the fact that errors in tests are no worse than the ones in production. Today's focus: Apache Hadoop.
Read more →
Total votes 4: ↑4 and ↓0 +4
Comments 0

Analyzing the Code of ROOT, Scientific Data Analysis Framework

Reading time 14 min
Views 2.3K
PVS-Studio corporate blog Open source *C++ *C *Big Data *
Picture 3
While Stockholm was holding the 118th Nobel Week, I was sitting in our office, where we develop the PVS-Studio static analyzer, working on an analysis review of the ROOT project, a big-data processing framework used in scientific research. This code wouldn't win a prize, of course, but the authors can definitely count on a detailed review of the most interesting defects plus a free license to thoroughly check the project on their own.


Picture 1

ROOT is a modular scientific software toolkit. It provides all the functionalities needed to deal with big data processing, statistical analysis, visualisation and storage. It is mainly written in C++. ROOT was born at CERN, at the heart of the research on high-energy physics. Every day, thousands of physicists use ROOT applications to analyze their data or to perform simulations.
Read more →
Total votes 22: ↑22 and ↓0 +22
Comments 4

What's new in ML.NET and Model Builder

Reading time 2 min
Views 912
Microsoft corporate blog .NET *Big Data *Machine learning *Artificial Intelligence
We are excited to announce updates to Model Builder and improvements in ML.NET. You can learn more in the «What’s new in ML.NET?.» session at .NET Conf.

ML.NET is an open-source and cross-platform machine learning framework (Windows, Linux, macOS) for .NET developers.

ML.NET offers Model Builder (a simple UI tool) and CLI to make it super easy to build custom ML Models using AutoML.

Using ML.NET, developers can leverage their existing tools and skillsets to develop and infuse custom AI into their applications by creating custom machine learning models for common scenarios like Sentiment Analysis, Recommendation, Image Classification and more!..

Read more →
Total votes 4: ↑4 and ↓0 +4
Comments 0

How we created IoT system for managing solar energy usage

Reading time 5 min
Views 1.2K
System Analysis and Design *IT Infrastructure *Big Data *Smart House IOT

If you have no idea about the development architecture and mechanical/electrical design behind IoT solutions, they could seem like "having seemingly supernatural qualities or powers". For example, if you show a working IoT system to 18th century people, they'd think it's magic.This article is sort of busting such myth. Or, to put it more technically, about hints for fine-tuning the IoT development for an awesome project in solar energy management area.

Read more →
Total votes 9: ↑7 and ↓2 +5
Comments 0

PVS-Studio Visits Apache Hive

Reading time 12 min
Views 1.1K
PVS-Studio corporate blog Information Security *Open source *Java *Big Data *
Рисунок 1

For the past ten years, the open-source movement has been one of the key drivers of the IT industry's development, and its crucial component. The role of open-source projects is becoming more and more prominent not only in terms of quantity but also in terms of quality, which changes the very concept of how they are positioned on the IT market in general. Our courageous PVS-Studio team is not sitting idly and is taking an active part in strengthening the presence of open-source software by finding hidden bugs in the enormous depths of codebases and offering free license options to the authors of such projects. This article is just another piece of that activity! Today we are going to talk about Apache Hive. I've got the report — and there are things worth looking at.
Read more →
Total votes 23: ↑20 and ↓3 +17
Comments 0

Contextual Emotion Detection in Textual Conversations Using Neural Networks

Reading time 10 min
Views 3.1K
VK corporate blog Python *Data Mining *Big Data *Machine learning *

Nowadays, talking to conversational agents is becoming a daily routine, and it is crucial for dialogue systems to generate responses as human-like as possible. As one of the main aspects, primary attention should be given to providing emotionally aware responses to users. In this article, we are going to describe the recurrent neural network architecture for emotion detection in textual conversations, that participated in SemEval-2019 Task 3 “EmoContext”, that is, an annual workshop on semantic evaluation. The task objective is to classify emotion (i.e. happy, sad, angry, and others) in a 3-turn conversational data set.
Read more →
Total votes 37: ↑37 and ↓0 +37
Comments 0

How to speed up LZ4 decompression in ClickHouse?

Reading time 23 min
Views 13K
Яндекс corporate blog High performance *Open source *C++ *Big Data *
When you run queries in ClickHouse, you might notice that the profiler often shows the LZ_decompress_fast function near the top. What is going on? This question had us wondering how to choose the best compression algorithm.

ClickHouse stores data in compressed form. When running queries, ClickHouse tries to do as little as possible, in order to conserve CPU resources. In many cases, all the potentially time-consuming computations are already well optimized, plus the user wrote a well thought-out query. Then all that's left to do is to perform decompression.

So why does LZ4 decompression becomes a bottleneck? LZ4 seems like an extremely light algorithm: the data decompression rate is usually from 1 to 3 GB/s per processor core, depending on the data. This is much faster than the typical disk subsystem. Moreover, we use all available CPU cores, and decompression scales linearly across all physical cores.
Read more →
Total votes 23: ↑21 and ↓2 +19
Comments 1

Google News and Leo Tolstoy: visualizing Word2Vec word embeddings using t-SNE

Reading time 7 min
Views 12K
VK corporate blog Python *Big Data *Data visualization *Machine learning *

Everyone uniquely perceives texts, regardless of whether this person reads news on the Internet or world-known classic novels. This also applies to a variety of algorithms and machine learning techniques, which understand texts in a more mathematical way, namely, using high-dimensional vector space.

This article is devoted to visualizing high-dimensional Word2Vec word embeddings using t-SNE. The visualization can be useful to understand how Word2Vec works and how to interpret relations between vectors captured from your texts before using them in neural networks or other machine learning algorithms. As training data, we will use articles from Google News and classical literary works by Leo Tolstoy, the Russian writer who is regarded as one of the greatest authors of all time.

We go through the brief overview of t-SNE algorithm, then move to word embeddings calculation using Word2Vec, and finally, proceed to word vectors visualization with t-SNE in 2D and 3D space. We will write our scripts in Python using Jupyter Notebook.

Read more →
Total votes 28: ↑28 and ↓0 +28
Comments 0

How to generate a huge financial graph with money laundering patterns?

Reading time 4 min
Views 2.7K
Abnormal programming *Python *Big Data *Open data *

Couple of years ago my team (compliance in one of Swiss banks) and I had an interesting task to implement — we had to generate a huge random graph of financial transactions between clients, companies and ATMs. Moreover, we wanted this graph to contain some money-laundering and other financial crime patterns alongside with nodes description such as names, addresses, currencies etc. Obviously, all data should be randomly generated from scratch as long as we could not use any real data for obvious reasons.

As a solution we wrote a generator that I’d love to share with you. This article explains why we needed it and how this generator is working, but if you don’t want to read and want to try it on your own here is the code: https://github.com/MGrin/transactions-graph-generator. I hope that our experience will be helpful to any of you.
Read more →
Total votes 3: ↑3 and ↓0 +3
Comments 0

How to write the home address right?

Reading time 16 min
Views 1.2K
XML *NoSQL *OpenStreetMap *Big Data *

How Tax Service, OpenStreetMap, and InterSystems IRIS
could help developers get clean addresses

Pieter Brueghel the Younger, Paying the Tax (The Tax Collector), 1640

In my previous article, we just skimmed the surface of objects. Let's continue our reconnaissance. Today's topic is a tough one. It's not quite BIG DATA, but it's still the data not easy to work with: we're talking about fairly large amounts of data. It won't all fit into RAM at once, and some of it won't even fit on the drive (not due to lack of space, but because there's a lot of junk). The name of our subject is FIAS DB: the Federal Information Address System database — the databases of addresses in Russia. The archive is 5.5 GB. And it's a compressed XML file. After extraction, it will be a full 53 GB (set aside 110 GB for extraction). And when you start to parse and convert it, that 110 GB won't be enough. There won't be enough RAM either.
Read more →
Total votes 8: ↑6 and ↓2 +4
Comments 0

Authors' contribution