Data Engineering *

discuss data collection and preparation

64,23

Rating

ArticlesPostsNewsAuthors

gfx_pro Oct 9 2023 at 15:55

A (more) accurate camera sensor dynamic range measurement

7 min

2.5K

Photographic equipmentData Engineering *

Analytics

Translation

Hello, everyone! In this post, let's talk about how to (more) accurately measure the dynamic range of a camera sensor and what can be done with these measurements.

Of course, I am not an expert in computer vision, a programmer or a statistician, so please feel free to correct me in the comments if I make mistakes in this post. Here my interest was primarily focused on everyday and practical tasks, such as photography, but I believe the results may also be useful to computer vision professionals.

kirill702b Aug 3 2023 at 12:02

How to access real-time smart contract data from Python code (using Lido contract as an example)

Medium

7 min

3.5K

Decentralized networks * Python * Solidity * CryptocurrenciesData Engineering *

Tutorial

Let’s imagine you need access to the real-time data of some smart contracts on Ethereum (or Polygon, BSC, etc.) like Uniswap or even PEPE coin to analyze its data using the standard data scientist/analyst tools: Python, Pandas, Matplotlib, etc. In this tutorial, I’ll show you more sophisticated data access tools that are more like a surgical scalpel (The Graph subgraphs) than a well-known Swiss knife (RPC node access) or hammer (ready-to-use APIs). I hope my metaphors don’t scare you ?.

Z1at Jun 13 2023 at 17:51

Mathematical meaning of principal component analysis (PCA)

Medium

7 min

3.5K

Big Data * Data Engineering *

This article aims at explaining the mathematical sense of the Principal Component Analysis (PCA) in practice.

Z1at Jun 2 2023 at 18:13

Pixel image rotation

Easy

13 min

2.4K

C * Data Engineering *

Brief problem formulation

The program accepts as input the absolute path to the image in the bmp extension and the path where you save the result of the work. Then, it rotates the image by 90 degrees counterclockwise. Afterwards, the program saves the new image.

The program is executed on C.

Z1at May 31 2023 at 11:45

Blinking into Morse code

Easy

10 min

3.6K

Data Engineering * IT Infrastructure * Python *

From sandbox

Explaining main algorithm.

For a while I’ve been thinking of writing a scientific article. I wanted it to have certain utility.

Morse code is binary: it takes only two values – either dot (short) or hyphen (long). I figured out that short (s) can stand for two-eye blinking whilst long (l) can indicate left-eye blinking. Another question emerged: how to understand when does one-symbol recording stop?

Empty space between two symbols can be presented by right-eye blinking – r. If I input singly symbol of short (dot) and long (hyphen), I will blink my right eye once to indicate the space between two symbols.

To separate independent words, one has to blink her right eye twice and get rr.

Hence, I have collected an ordered set of symbols – r, l, s, - that can be converted into a full-fledged text. Once I accomplish the transformation, I get an answer.

AmiraB2 May 30 2023 at 07:53

Feature Engineering: Techniques and Best Practices for Data Scientists

8 min

Data Engineering * Big Data *

Tutorial

The most important stage in the data science process is feature engineering, which entails turning raw data into useful features that might enhance the performance of machine learning models. It calls for creativity, data-driven thinking, and domain expertise. Data scientists can improve the prediction capability of their models and find hidden patterns in the data by choosing, combining, and inventing relevant features. Handling missing data, scaling features, encoding categorical variables, constructing interaction terms, and other procedures are examples of feature engineering techniques. The best practises involve investigating the data, testing and improving features iteratively, and applying domain knowledge to draw out important information. The accuracy and effectiveness of machine learning models are significantly influenced by effective feature engineering.

Evrone Nov 16 2022 at 11:13

How we designed the user interface for an enterprise analytical system

5 min

1.9K

Singula Team corporate blogBig Data * CGI * Data Engineering *

In 2021, we were contacted by an industrial plant that was faced with the need to create a system for analyzing processes in its production. The enterprise team studied ready-made solutions, but none of the analytics system designs fully covered the required functionality. So they turned to us with a request to develop their own analytical system that would collect data from all machines and allow it to be analyzed to see bottlenecks in production. For this project, we created a data-driven UI/UX design and also developed a web-based interface for the equipment monitoring system.

IgKend Oct 21 2022 at 18:42

How Yandex Made Their Biggest Improvement in the Search Engine with the Help of Toloka

5 min

3.4K

Search engines * Data Mining * Machine learning * Artificial IntelligenceData Engineering *

Tutorial

Toloka is a crowdsourcing platform and microtasking project launched by Yandex to quickly markup large amounts of data. But how can such a simple concept play a crucial role in improving the work of neural networks?

Learn how

alexandervolchek Jul 11 2022 at 10:02

What are neural networks and what do we need them for?

4 min

6.3K

Mathematics * Machine learning * Data Engineering *

Explaining through simple examples

For a long time, people have been thinking on how to create a computer that could think like a person. The advent of artificial neural networks is a significant step in this direction. Our brain consists of neurons that receive information from sensory organs and process it: we recognize people we know by their faces, and we feel hungry when we see delicious food. All of this is the result of brain neurons working and interacting with each other. This is also the principle that artificial neural networks are based on, simulating the processes occurring in the human brain.

What are neural networks

Artificial neural networks are a software code that imitates the work of a brain and is capable of self-learning. Like a biological network, an artificial network also consists of neurons, but they have a simpler structure.

If you connect neurons into a sufficiently large network with controlled interaction, they will be able to perform quite complex tasks. For example, determining what is shown in a picture, or independently creating a photorealistic image based on a text description.

IvanSGlazunov Apr 1 2022 at 18:12

Math introduction to Deep Theory

Hard

4 min

Deep.Foundation corporate blogSQL * Mathematics * Data Engineering *

In this article, we would like to compare the core mathematical bases of the two most popular theories and associative theory.

Calculating deep

AlexZus Oct 1 2021 at 16:27

Millions of orders per second matching engine testing

4 min

13K

Big Data * C++ * Data Engineering * Data Mining *

From sandbox

I had some experience in the matching engine development for cryptocurrency exchange some time ago. That was an interesting and challenging experience. I developed it in clear C++ from scratch. The testing of it is also quite a challenging task. You need to get data for testing, perform testing, collect some statistics, and at last, analyze collected data to find weak points and bottlenecks. I want to focus on testing the C++ matching engine and show how testing can give insights for optimizations even without the need to change the code. The matching engine I developed can do more than 1’000’000 TPS (transactions per second) and is 10x times faster than the matching engine of the Binance cryptocurrency exchange (see one post on Binance Blog).

olegchir Sep 28 2021 at 05:46

Big Data Tools with IntelliJ IDEA Ultimate, PyCharm Professional, DataGrip 2021.3 EAP, and DataSpell Support

1 min

3.3K

JetBrains corporate blogProgramming * Big Data * Data Engineering *

Recently we released a new build of the Big Data Tools plugin that is compatible with the 2021.3 versions of IntelliJ IDEA and PyCharm. DataGrip 2021.3 support will be available immediately after the release in October. The plugin also supports our new data science IDE – JetBrains DataSpell. If you still use previous versions, now is the perfect time to upgrade both your IDE and the plugin.

This year, we introduced a number of new features as well as some features that have been there for a while, for example, running Spark Submit with a run configuration.

Here’s a list of the key improvements:

AnjaHolosova May 25 2021 at 10:39

One of the ways to dynamically deserialize a part of a JSON document with an unknown structure

7 min

17K

.NET * C# * Data Engineering *

Tutorial

In this topic, I will tell you how to dynamically parse and deserialize only part of the whole JSON document. We will create an implementation for .NET Core with C# as a language.

For example, we have the next JSON as a data source for the report. Notice that we will get this JSON in the runtime and at the compile step we don't know the structure of this document. And what if you need to select only several fields for processing?

Read this amazing post

NIX_Solutions May 6 2021 at 13:43

Benefits of Hybrid Data Lake: How to combine Data Warehouse with Data Lake

4 min

3.4K

NIX corporate blogData Mining * Data Engineering *

Hey, hey! I am Ilya Kalchenko, a Data Engineer at NIX, a fan of big and small data processing, and Python. In this article, I want to discuss the benefits of hybrid data lakes for efficient and secure data organization.

To begin with, I invite you to figure out the concepts of Data Warehouses and Data Lake. Let’s delve into the use cases and delimit areas of responsibility.

FizpokPak Feb 1 2021 at 10:51

Coins classifier Neural Network: Head or Tail?

14 min

2.2K

Big Data * Data Engineering * Data Mining * Python * TensorFlow *

Home of this article: https://robotics.snowcron.com/coins/02_head_or_tail.htm

The global objective of these articles is to build a coin classifier, capable of scanning your pocket change and find rare / valuable coins. This is a second article in a series, so let me remind you what happened earlier (https://habr.com/ru/post/538958/).

During previous step we got a rather large dataset composed of pairs of images, loaded from an online coins site meshok.ru. Those images were uploaded to the Internet by people we do not know, and though they are supposed to contain coin's head in one image and tail in the other, we can not rule out a situation when we have two heads and no tail and vice versa. Also at the moment we have no idea which image contains head and which contains tail: this might be important when we feed data to our final classifier.

So let's write a program to distinguish heads from tails. It is a rather simple task, involving a convolutional neural network that is using transfer learning.

Same way as before, we are going to use Google Colab environment, taking the advantage of a free video card they grant us an access to. We will store data on a Google Drive, so first thing we need is to allow Colab to access the Drive:

FizpokPak Jan 24 2021 at 18:59

Coins Classification using Neural Networks

19 min

3.6K

Python * Data Engineering * Big Data *

Tutorial

See more at robotics.snowcron.comThis is the first article in a serie dedicated to coins classification.Having countless "dogs vs cats" or "find a pedestrian on the street" classifiers all over the Internet, coins classification doesn't look like a difficult task. At first. Unfortunately, it is degree of magnitude harder - a formidable challenge indeed. You can easily tell heads of tails? Great. Can you figure out if the number is 1 mm shifted to the left? See, from classifier's view it is still the same head... while it can make a difference between a common coin priced according to the number on it and a rare one, 1000 times more expensive.Of course, we can do what we usually do in image classification: provide 10,000 sample images... No, wait, we can not. Some types of coins are rare indeed - you need to sort through a BASKET (10 liters) of coins to find one. Easy arithmetics suggests that to get 10000 images of DIFFERENT coins you will need 10,000 baskets of coins to start with. Well, and unlimited time.So it is not that easy.Anyway, we are going to begin with getting large number of images and work from there. We will use Russian coins as an example, as Russia had money reform in 1994 and so the number of coins one can expect to find in the pocket is limited. Unlike USA with its 200 years of monetary history. And yes, we are ONLY going to focus on current coins: the ultimate goal of our work is to write a program for smartphone to classify coins you have received in a grocery store as a change.Which makes things even worse, as we can not count on good lighting and quality cameras anymore. But we'll still try.In addition to "only Russian coins, beginning from 1994", we are going to add an extra limitation: no special occasion coins. Those coins look distinctive, so anyone can figure that this coin is special. We focus on REGULAR coins. Which limits their number severely.Don't take me wrong: if we need to apply the same approach to a full list of coins... it will work. But I got 15 GB of images for that limited set, can you imagine how large the complete set will be?!To get images, I am going to scan one of the largest Russian coins site "meshok.ru".This site allows buyers and sellers to find each other; sellers can upload images... just what we need. Unfortunately, a business-oriented seller can easily upload his 1 rouble image to 1, 2, 5, 10 roubles topics, just to increase the exposure.

So we can not count on the topic name, we have to determine what coin is on the photo ourselves.To scan the site, a simple scanner was written, based on the Python's Beautiful Soup library. In just few hours I got over 50,000 photos. Not a lot by Machine Learning standards, but definitely a start.After we got the images, we have to - unfortunately - revisit them by hand, looking for images we do not want in our training set, or for images that should be edited somehow. For example, someone could have uploaded a photo of his cat. We don't need a cat in our dataset.First, we delete all images, that can not be split to head/hail.

lukyanchikov Sep 20 2020 at 20:33

InterSystems IRIS – the All-Purpose Universal Platform for Real-Time AI/ML

22 min

1.6K

InterSystems corporate blogData Engineering * DevOps * Artificial IntelligenceMachine learning *

Author: Sergey Lukyanchikov, Sales Engineer at InterSystems

Challenges of real-time AI/ML computations

We will start from the examples that we faced as Data Science practice at InterSystems:

A “high-load” customer portal is integrated with an online recommendation system. The plan is to reconfigure promo campaigns at the level of the entire retail network (we will assume that instead of a “flat” promo campaign master there will be used a “segment-tactic” matrix). What will happen to the recommender mechanisms? What will happen to data feeds and updates into the recommender mechanisms (the volume of input data having increased 25000 times)? What will happen to recommendation rule generation setup (the need to reduce 1000 times the recommendation rule filtering threshold due to a thousandfold increase of the volume and “assortment” of the rules generated)?
An equipment health monitoring system uses “manual” data sample feeds. Now it is connected to a SCADA system that transmits thousands of process parameter readings each second. What will happen to the monitoring system (will it be able to handle equipment health monitoring on a second-by-second basis)? What will happen once the input data receives a new bloc of several hundreds of columns with data sensor readings recently implemented in the SCADA system (will it be necessary, and for how long, to shut down the monitoring system to integrate the new sensor data in the analysis)?
A complex of AI/ML mechanisms (recommendation, monitoring, forecasting) depend on each other’s results. How many man-hours will it take every month to adapt those AI/ML mechanisms’ functioning to changes in the input data? What is the overall “delay” in supporting business decision making by the AI/ML mechanisms (the refresh frequency of supporting information against the feed frequency of new input data)?

varunbhagat1 Aug 31 2020 at 07:22

Data Science vs AI: All You Need To Know

4 min

2.4K

.NET * Angular * Data Engineering * DevOps * Artificial Intelligence

What do these terms mean? And what is the difference?

Data Science and Artificial Intelligence are creating a lot of buzzes these days. But what do these terms mean? And what is the difference between them?

While the terms Data Science and Artificial Intelligence (AI) comes under the same domain and are inter-connected to each other, they have their specific applications and meaning.

There’s no slowing down the spread of AI and data science. Many big tech giants are extensively investing in these technologies. As per the recent survey, it is estimated that artificial intelligence could add $15.7 trillion to the global economy by 2030.

Through this piece of writing, I will be explaining about the AI and data science concepts and their differences in detail. So, without wasting any more time, let’s get started!

PastorGL Jan 31 2020 at 16:55

Introducing One Ring — an open-source pipeline for all your Spark applications

23 min

1.8K

Big Data * Data Engineering * Hadoop * Java * Open source *

If you utilize Apache Spark, you probably have a few applications that consume some data from external sources and produce some intermediate result, that is about to be consumed by some applications further down the processing chain, and so on until you get a final result.

We suspect that because we have a similar pipeline with lots of processes like this one:

A process flowchart with more than 50 applications and about 70 datasets
Click here for a bit larger version

Each rectangle is a Spark application with a set of their own execution parameters, and each arrow is an equally parametrized dataset (externally stored highlighted with a color; note the number of intermediate ones). This example is not the most complex of our processes, it’s fairly a simple one. And we don’t assemble such workflows manually, we generate them from Process Templates (outlined as groups on this flowchart).

So here comes the One Ring, a Spark pipelining framework with very robust configuration abilities, which makes it easier to compose and execute a most complex Process as a single large Spark job.

And we just made it open source. Perhaps, you’re interested in the details.

We got you covered!