Big Data *

Everything about big data

Articles Posts News Authors

ShifaMartin May 5 2020 at 14:54

Reach Out Top Hadoop Consulting Companies To Leverage Big Data In 2020

7 min

1.1K

Big Data*Hadoop*

Hadoop is divided into different modules, each of which delivers a distinct task crucial for a computer system and is uniquely designed for big data analytics. Apache Software Foundation developed this incredible platform. It is extensively utilized by worldwide developers to build big data Hadoop solutions amazingly and easily.

Big data offers several perks, some of them are; examining root causes of failures, recognizing the potential of data-driven marketing, improving and enhancing customer engagement, and much more. By offering multiple solutions in a single stream it helps in lowering the cost of the organization.

In various industries such as Retail, Manufacturing, Financial insurance, Education, Transportation, Agriculture, Healthcare, Energy, etc big data is utilized and that’s why it’s demand is expanding day by day. The Global Hadoop Market is envisioned to grow to $84.6 billion by 2021, with an expected CAGR of 63.4%.

TheQuantumDaily May 3 2020 at 04:29

Could Quantum Computing Help Reverse Climate Change?

4 min

991

The unique powers of quantum computation may give humanity an important weapon — or several weapons — against climate change, according to one quantum computer pioneer.

One of the possible solutions for the excess carbon in the atmosphere and to reach global climate goals is to suck it out. It sounds pretty easy, but, in fact, the technology to do so cheaply and easily isn’t quite here yet, according to Jeremy O’Brien Chief Executive Officer, PsiQuantum, a quantum computing startup.

Currently, there is no way to simulate large complex molecules, like carbon dioxide. Current classical computers cannot simulate these types of molecules because the problem grows exponentially with the size or complexity of the simulated molecules, according to O’Brien, who wrote an article outlining the issue at the World Economic Forum’s annual meeting held recently.

“Crudely speaking, if simulating a molecule with 10 atoms takes a minute, a molecule with 11 takes two minutes, one with 12 atoms takes four minutes and so on,” he writes. “This exponential scaling quickly renders a traditional computer useless: simulating a molecule with just 70 atoms would take longer than the lifetime of the universe (13 billion years).”

TheQuantumDaily Apr 26 2020 at 12:00

The World’s Top 12 Quantum Computing Research Universities

5 min

Cloud computing*Big Data*Popular scienceQuantum technologies

From sandbox

In just a few years, quantum computing and quantum information theory has gone from a fringe subject offered in small classes at odd hours in the corner of the physics building annex to a full complement of classes in well-funded programs being held at quantum centers and institutes at leading universities.

The question now for many would-be quantum computer students is not, “Are there universities that even offer classes in quantum computing,” but, rather, “Which universities are leaders at quantum computing research.”

We’ll look at some of the best right now:

The Institute for Quantum Computing — University of Waterloo

The University of Waterloo can proudly declare that, while many universities avoided offering quantum computing classes like cat adoption agencies avoided adoption applications from the Schrodinger family, this Canadian university went all in.

And it paid off.

PastorGL Jan 31 2020 at 19:55

Introducing One Ring — an open-source pipeline for all your Spark applications

23 min

1.4K

Open source*Java*Big Data*Hadoop*Data Engineering*

If you utilize Apache Spark, you probably have a few applications that consume some data from external sources and produce some intermediate result, that is about to be consumed by some applications further down the processing chain, and so on until you get a final result.

We suspect that because we have a similar pipeline with lots of processes like this one:

A process flowchart with more than 50 applications and about 70 datasets
Click here for a bit larger version

Each rectangle is a Spark application with a set of their own execution parameters, and each arrow is an equally parametrized dataset (externally stored highlighted with a color; note the number of intermediate ones). This example is not the most complex of our processes, it’s fairly a simple one. And we don’t assemble such workflows manually, we generate them from Process Templates (outlined as groups on this flowchart).

So here comes the One Ring, a Spark pipelining framework with very robust configuration abilities, which makes it easier to compose and execute a most complex Process as a single large Spark job.

And we just made it open source. Perhaps, you’re interested in the details.

We got you covered!

o6CuFl2Q Jan 23 2020 at 19:50

Five Methods For Database Obfuscation

20 min

7.2K

Яндекс corporate blogOpen source*Algorithms*Big Data*Machine learning*

ClickHouse users already know that its biggest advantage is its high-speed processing of analytical queries. But claims like this need to be confirmed with reliable performance testing. That's what we want to talk about today.

We started running tests in 2013, long before the product was available as open source. Back then, just like now, our main concern was data processing speed in Yandex.Metrica. We had been storing that data in ClickHouse since January of 2009. Part of the data had been written to a database starting in 2012, and part was converted from OLAPServer and Metrage (data structures previously used by Yandex.Metrica). For testing, we took the first subset at random from data for 1 billion pageviews. Yandex.Metrica didn't have any queries at that point, so we came up with queries that interested us, using all the possible ways to filter, aggregate, and sort the data.

ClickHouse performance was compared with similar systems like Vertica and MonetDB. To avoid bias, testing was performed by an employee who hadn't participated in ClickHouse development, and special cases in the code were not optimized until all the results were obtained. We used the same approach to get a data set for functional testing.

After ClickHouse was released as open source in 2016, people began questioning these tests.

Andrey2008 Jan 16 2020 at 15:11

Machine Learning in Static Analysis of Program Source Code

27 min

2.9K

PVS-Studio corporate blogProgramming*Big Data*Machine learning*Artificial Intelligence

Machine Learning in Static Analysis of Program Source Code

Machine learning has firmly entrenched in a variety of human fields, from speech recognition to medical diagnosing. The popularity of this approach is so great that people try to use it wherever they can. Some attempts to replace classical approaches with neural networks turn up unsuccessful. This time we'll consider machine learning in terms of creating effective static code analyzers for finding bugs and potential vulnerabilities.

ryan0852 Jan 10 2020 at 10:00

How Ecommerce Fueled By the Pillars of AI Technology

4 min

817

Development of mobile applications*Big Data*Product Management*SoftwareArtificial Intelligence

At present, we see artificial intelligence is implemented across the corridors of business operations and also the way we shop and trade online. To hit a home run in the retail game, genius AI applications, PIM solutions, and e-commerce development tools are now offering smart solutions: predictive analysis, recommendation engines, inventory management, and warehouse automation to create a more profitable shopping experience for consumers.

Now more than ever, e-commerce is an AI innovation game

Artificial Intelligence often sometimes seems complicated to newbies but in reality, it is simple in use and gives you the ability to predict customer needs. This paves the way for e-commerce companies to become a “big brand” or “big business” with revolutionary AI tools.

Now that AI algorithms making way for consumer acceptance of AI like never before, how can you use it to create more profitable outcomes in e-commerce?

Interesting E-commerce Stats:

With an estimated global population of 7.7 billion, 25 percent of people shopping through e-commerce stores. According to Statista, 52% of e-commerce stores will have omnichannel capabilities by 2020 which means they can communicate and sell with their consumers via multiple channels. For example, they can use their e-commerce website, Facebook e-shop, email account, and Instagram account.

Examples of AI tools and PIM software for e-commerce businesses that can help them have a high bar on customer service and marketing:

-1

stefanbuzz Dec 18 2019 at 11:55

Apache Hadoop Code Quality: Production VS Test

11 min

646

PVS-Studio corporate blogOpen source*Java*Big Data*Hadoop*

In order to get high quality production code, it's not enough just to ensure maximum coverage with tests. No doubts, great results require the main project code and tests to work efficiently together. Therefore, tests have to be paid as much attention as the main code. A decent test is a key success factor, as it will catch regression in production. Let's take a look at PVS-Studio static analyzer warnings to see the importance of the fact that errors in tests are no worse than the ones in production. Today's focus: Apache Hadoop.

BeyersJulia Nov 18 2019 at 18:45

Cryptocurrency trading — How to develop a sustainable strategy

3 min

1.3K

Big Data*

The cryptocurrency space is incredibly popular among investors and traders alike. Many of the most popular crypto assets have grown a lot in recent years. Let’s take Bitcoin.

SvyatoslavMC Oct 22 2019 at 10:05

Analyzing the Code of ROOT, Scientific Data Analysis Framework

14 min

2.4K

PVS-Studio corporate blogOpen source*C++*C*Big Data*

While Stockholm was holding the 118th Nobel Week, I was sitting in our office, where we develop the PVS-Studio static analyzer, working on an analysis review of the ROOT project, a big-data processing framework used in scientific research. This code wouldn't win a prize, of course, but the authors can definitely count on a detailed review of the most interesting defects plus a free license to thoroughly check the project on their own.

Introduction

ROOT is a modular scientific software toolkit. It provides all the functionalities needed to deal with big data processing, statistical analysis, visualisation and storage. It is mainly written in C++. ROOT was born at CERN, at the heart of the research on high-energy physics. Every day, thousands of physicists use ROOT applications to analyze their data or to perform simulations.

+22

msgeek Oct 1 2019 at 10:00

What's new in ML.NET and Model Builder

2 min

977

Microsoft corporate blog.NET*Big Data*Machine learning*Artificial Intelligence

We are excited to announce updates to Model Builder and improvements in ML.NET. You can learn more in the «What’s new in ML.NET?.» session at .NET Conf.

ML.NET is an open-source and cross-platform machine learning framework (Windows, Linux, macOS) for .NET developers.

ML.NET offers Model Builder (a simple UI tool) and CLI to make it super easy to build custom ML Models using AutoML.

Using ML.NET, developers can leverage their existing tools and skillsets to develop and infuse custom AI into their applications by creating custom machine learning models for common scenarios like Sentiment Analysis, Recommendation, Image Classification and more!..

sashalisik Sep 6 2019 at 13:27

How we created IoT system for managing solar energy usage

5 min

1.3K

System Analysis and Design*IT Infrastructure*Big Data*Smart HouseIOT

If you have no idea about the development architecture and mechanical/electrical design behind IoT solutions, they could seem like "having seemingly supernatural qualities or powers". For example, if you show a working IoT system to 18th century people, they'd think it's magic.This article is sort of busting such myth. Or, to put it more technically, about hints for fine-tuning the IoT development for an awesome project in solar energy management area.

stefanbuzz Aug 15 2019 at 10:06

PVS-Studio Visits Apache Hive

12 min

1.2K

PVS-Studio corporate blogInformation Security*Open source*Java*Big Data*

For the past ten years, the open-source movement has been one of the key drivers of the IT industry's development, and its crucial component. The role of open-source projects is becoming more and more prominent not only in terms of quantity but also in terms of quality, which changes the very concept of how they are positioned on the IT market in general. Our courageous PVS-Studio team is not sitting idly and is taking an active part in strengthening the presence of open-source software by finding hidden bugs in the enormous depths of codebases and offering free license options to the authors of such projects. This article is just another piece of that activity! Today we are going to talk about Apache Hive. I've got the report — and there are things worth looking at.

+17

sismetanin Aug 1 2019 at 13:35

Contextual Emotion Detection in Textual Conversations Using Neural Networks

10 min

3.7K

VK corporate blogPython*Data Mining*Big Data*Machine learning*

Nowadays, talking to conversational agents is becoming a daily routine, and it is crucial for dialogue systems to generate responses as human-like as possible. As one of the main aspects, primary attention should be given to providing emotionally aware responses to users. In this article, we are going to describe the recurrent neural network architecture for emotion detection in textual conversations, that participated in SemEval-2019 Task 3 “EmoContext”, that is, an annual workshop on semantic evaluation. The task objective is to classify emotion (i.e. happy, sad, angry, and others) in a 3-turn conversational data set.

+37

o6CuFl2Q Jun 25 2019 at 17:42

How to speed up LZ4 decompression in ClickHouse?

23 min

15K

Яндекс corporate blogHigh performance*Open source*C++*Big Data*

When you run queries in ClickHouse, you might notice that the profiler often shows the LZ_decompress_fast function near the top. What is going on? This question had us wondering how to choose the best compression algorithm.

ClickHouse stores data in compressed form. When running queries, ClickHouse tries to do as little as possible, in order to conserve CPU resources. In many cases, all the potentially time-consuming computations are already well optimized, plus the user wrote a well thought-out query. Then all that's left to do is to perform decompression.

So why does LZ4 decompression becomes a bottleneck? LZ4 seems like an extremely light algorithm: the data decompression rate is usually from 1 to 3 GB/s per processor core, depending on the data. This is much faster than the typical disk subsystem. Moreover, we use all available CPU cores, and decompression scales linearly across all physical cores.

+19

sismetanin Apr 30 2019 at 11:42

Google News and Leo Tolstoy: visualizing Word2Vec word embeddings using t-SNE

7 min

13K

VK corporate blogPython*Big Data*Data visualization*Machine learning*

Everyone uniquely perceives texts, regardless of whether this person reads news on the Internet or world-known classic novels. This also applies to a variety of algorithms and machine learning techniques, which understand texts in a more mathematical way, namely, using high-dimensional vector space.

This article is devoted to visualizing high-dimensional Word2Vec word embeddings using t-SNE. The visualization can be useful to understand how Word2Vec works and how to interpret relations between vectors captured from your texts before using them in neural networks or other machine learning algorithms. As training data, we will use articles from Google News and classical literary works by Leo Tolstoy, the Russian writer who is regarded as one of the greatest authors of all time.

We go through the brief overview of t-SNE algorithm, then move to word embeddings calculation using Word2Vec, and finally, proceed to word vectors visualization with t-SNE in 2D and 3D space. We will write our scripts in Python using Jupyter Notebook.

+28

Mgrin Apr 12 2019 at 00:48

How to generate a huge financial graph with money laundering patterns?

4 min

2.9K

Abnormal programming*Python*Big Data*Open data*

Couple of years ago my team (compliance in one of Swiss banks) and I had an interesting task to implement — we had to generate a huge random graph of financial transactions between clients, companies and ATMs. Moreover, we wanted this graph to contain some money-laundering and other financial crime patterns alongside with nodes description such as names, addresses, currencies etc. Obviously, all data should be randomly generated from scratch as long as we could not use any real data for obvious reasons.

As a solution we wrote a generator that I’d love to share with you. This article explains why we needed it and how this generator is working, but if you don’t want to read and want to try it on your own here is the code: https://github.com/MGrin/transactions-graph-generator. I hope that our experience will be helpful to any of you.

torgeek Apr 1 2019 at 23:32

How to write the home address right?

16 min

1.3K

XML*NoSQL*OpenStreetMap*Big Data*

How Tax Service, OpenStreetMap, and InterSystems IRIS
could help developers get clean addresses

Pieter Brueghel the Younger, Paying the Tax (The Tax Collector), 1640

In my previous article, we just skimmed the surface of objects. Let's continue our reconnaissance. Today's topic is a tough one. It's not quite BIG DATA, but it's still the data not easy to work with: we're talking about fairly large amounts of data. It won't all fit into RAM at once, and some of it won't even fit on the drive (not due to lack of space, but because there's a lot of junk). The name of our subject is FIAS DB: the Federal Information Address System database — the databases of addresses in Russia. The archive is 5.5 GB. And it's a compressed XML file. After extraction, it will be a full 53 GB (set aside 110 GB for extraction). And when you start to parse and convert it, that 110 GB won't be enough. There won't be enough RAM either.

1 2

Big Data *

The Institute for Quantum Computing — University of Waterloo

Now more than ever, e-commerce is an AI innovation game

Interesting E-commerce Stats:

Introduction

How Tax Service, OpenStreetMap, and InterSystems IRIS could help developers get clean addresses

How Tax Service, OpenStreetMap, and InterSystems IRIS
could help developers get clean addresses