Data Engineering *

discuss data collection and preparation

69,79

Rating

ArticlesPostsNewsAuthors

SergeyProkhorenko Jun 16 at 12:00

UUIDv7 and RFC 9562 FAQ

Medium

3 min

Distributed systems * High performance * Data Engineering * IT Standards * System Analysis and Design *

FAQ

I am a contributor to RFC 9562, and I wrote this FAQ because I kept seeing the same concerns about UUIDv7 surfacing across Hacker News, Reddit, and developer blogs. The discussion would always circle back to the same objections, most of which are superficial and fall apart when you look closer. UUIDv7 is too big. The timestamp leaks creation time. It creates hot shards. It’s slower than BIGINT. Sorting breaks if clocks drift. After encountering these objections repeatedly, I decided to collect the most debated questions in one place and answer them based on real implementation experience.

Get inside info

lukyanchikov Jun 10 at 13:27

jBPM as Quantum Orchestration Platform

Medium

6 min

4.5K

Visual programming * Openshift * Open source * Data Engineering * Quantum technologies

Review

Author: Sergey Lukyanchikov, C-NLTX/Open-Source

Disclaimer: The views expressed in this document reflect the author's subjective perspective on the current and potential capabilities of jBPM.

TL;DR: Zero "quantum supremacy". Zero "agentic orchestration". Zero other hype. Just an approach to achieving an efficient quantum-assisted automation using 100% free open-source components (except for Azure).

In my previous article, I discussed the rationale for adopting jBPM as an AI orchestration platform. This article extends that discussion by examining jBPM’s ability to automate quantum computations and to incorporate their results into business processes and related analytical workflows:

ArcaneGamingcom May 5 at 11:16

A short guide on UX audit and how it can benefit any software product

Easy

4 min

6.7K

Big Data * Data Engineering * Mobile App Analytics * Mobile applications design * Usability *

FAQ

UX audit is a professional review and evaluation of a software product's UX, aimed to identify any types of issues that have a negative impact on the product's performance and provoke user frustration. Its ultimate goal is to provide recommendations on which areas of the product need to be improved to make it more user-oriented and therefore more useful and profitable for the business. Let's discuss how to know when the product needs a UX audit, how to prepare for the process as the product owner, which steps the process consists of and what to do with the results.

ArcaneGamingcom Feb 26 at 15:48

AI Tools — A Real-World Look at Data, Development, and Analytics

Medium

5 min

17K

Data Engineering * Usability * CRM systems *

Retrospective

In this article, our team shares how artificial intelligence and modern analytics tools have shaped the way our projects are built, tested, and optimized — from code to player experience.

Artificial intelligence isn’t just hype for us — it’s a practical, everyday part of how we build and improve our projects. AI empowers us to accelerate development, improve reliability, personalize experiences, and make smarter decisions based on real player behavior. Below, we walk through how AI is integrated into key parts of our platform.

habrconnect Feb 18 at 06:30

Local Chatbot Without Limits: A Guide to LM Studio and Open LLMs

Easy

11 min

2.6K

Machine learning * SoftwareArtificial IntelligenceData Engineering *

Tutorial

Translation

In this article, we will not only install a local (and free) alternative to ChatGPT, but also review several open LLMs, delve into the advanced settings of LM Studio, connect the chatbot to Visual Studio Code, and teach it to assist us with programming. We will also look at how to fine-tune the model's behavior using system prompts.

habrconnect Feb 18 at 05:42

SQL Window Functions in Simple Terms with Examples

4 min

613

SQL * Database Administration * Data Engineering *

Translation

Hello everyone!

I want to note right away that this article is written exclusively for people who are just starting their journey in learning SQL and window functions. It may not cover complex applications of functions or use complicated definitions—everything is written in the simplest language possible for a basic understanding.

P.S. If the author didn't cover or write about something, it means they considered it non-essential for this article)))

For the examples, we will use a small table that shows student grades in different subjects. In the database, the table looks like this:

PhoenixLi Dec 24 2025 at 11:59

Delivering Faster Analytics at Pinterest

Medium

6 min

7.3K

Database Administration * Data Engineering * Big Data * Open source *

Case

Pinterest is a visual discovery platform where people can find ideas like recipes, home and style inspiration, and much more. The platform offers its partners shopping capabilities as well as a significant advertising opportunity with 500+ million monthly active users. Advertisers can purchase ads directly on Pinterest or through partnerships with advertising agencies. Due to our huge scale, advertisers get an opportunity to learn about their Pins and their interaction with Pinterest users from the analytical data. This gives advertisers an opportunity to make decisions which will allow their ads to perform better on our platform.

kilyashenko Dec 7 2025 at 00:02

Domain-Specific system based on console JAVA applications

Medium

8 min

9.4K

Java * Development for Linux * Development for Windows * Data Engineering * Open source *

Review

Hello, Habr! I'd like to share my experience developing such a system.
The defining parameters of a domain-specific system are:

PhoenixLi Nov 4 2025 at 06:05

StarRocks vs. ClickHouse, Apache Druid, and Trino

Easy

8 min

9.4K

Data Engineering * Big Data * SQL *

Analytics

In the big data era, data is one of the most valuable assets for enterprises. The ultimate goal of data analytics is to power swift, agile business decision making. As database technologies advance at a breathtaking pace in recent years, a large number of excellent database systems have emerged. Some of them are impressive in wide-table queries but do not work well in complex queries. Some support flexible multi-table queries but are held back by slow query speed.

Each type of data has a data model that best represents them. However, in real business scenarios, there is no such thing as ultra-fast data analytics under the perfect data model. Big data engineers sometimes have to make compromises on data models. Such compromises may cause long latency in complex queries or damage the real-time query performance because engineers must take the trouble to convert complex data models into flat tables.

New business requirements put forward new challenges for database systems. A good OLAP database system must be able to deliver excellent performance in both wide-table and multi-table scenarios. This system must also reduce the workload of big data engineers and enable customers to query data of any dimension in real time without worrying about data construction.

PhoenixLi Oct 30 2025 at 03:18

Comparison: StarRocks vs Apache Druid

Easy

5 min

Data Engineering * Open source * Big Data * SQL *

Analytics

Apache Druid has been a staple for real-time analytics. However, with evolving and sophisticated analytics demands, it has faced challenges in satisfying modern data performance needs. Enter StarRocks, a high-performance, open-source analytical database, designed to adeptly meet the advanced analytics needs of contemporary enterprises by offering robust capabilities and performance.

In this article, we’ll explore the functionalities, strengths, and challenges of both Apache Druid and StarRocks. Using practical examples and benchmark results, we aim to guide you in identifying which database might best meet your data needs.

melanny20 Oct 22 2025 at 14:03

4 best tips to building high-quality data products from SYNQ

Easy

6 min

11K

Postgres Professional corporate blogData Engineering * Big Data *

Tutorial

Translation

The “test everything” principle doesn’t improve data quality — it destroys it. Hundreds of useless alerts create noise that drowns out truly important signals, and the team stops responding to them. Google and Monzo have already moved away from this approach.

Here’s how to shift from blanket testing to targeted checks at nodes with the greatest impact radius — and why one well-placed test at the source is worth more than a hundred checks downstream.

SergeyProkhorenko Sep 2 2025 at 10:31

6NF File Format

Medium

2 min

29K

SQL * ERP-systems * Big Data * Data Engineering *

Analytics

Filename Extension: .6nf

6NF File Format is a new bitemporal, sixth-normal-form (6NF)-inspired data exchange format designed for DWH and for reporting. It replaces complex hierarchical formats like XBRL, XML, JSON, and YAML

Konard Apr 1 2025 at 12:15

The Links Theory 0.0.2

Medium

27 min

6.7K

Data Engineering * Open source * Mathematics * Abnormal programming * Programming *

Translation

This world needs a new theory — a theory that could describe all the theories on the planet. A theory that could easily describe philosophy, mathematics, physics, and psychology. The one that makes all kinds of sciences computable.

This is exactly what we are working on. If we succeed, this theory will become the unified meta-theory of everything.

A year has passed since our last publication, and our task is to share the progress with our English-speaking audience. This is still not a stable version; it’s a draft. Therefore, we welcome any feedback, as well as your participation in the development of the links theory.

As with everything we have done before, the links theory is published and released into the public domain — it belongs to humanity, that means, it is yours. This work has many authors, but the work itself is far more important than any specific authorship. We hope that today it can become useful to more people.

We invite you to become a part of this exciting adventure.

Witness the birth of meta-theory

+10

lukyanchikov Mar 13 2025 at 10:46

jBPM as AI Orchestration Platform

Easy

4 min

1.6K

Artificial IntelligenceData Engineering * Open source * Openshift * Visual programming *

Review

Author: Sergey Lukyanchikov, C-NLTX/Open-Source

Disclaimer: The views expressed in this document reflect the author's subjective perspective on the current and potential capabilities of jBPM.

TL;DR: Zero "agentic AI". Zero "cloud native". Zero other hype. Just an approach to achieving an efficient AI-centric automation using 100% free open-source components.

This text presents jBPM as a platform for orchestrating external AI-centric environments, such as Python, used for designing and running AI solutions. We will provide an overview of jBPM’s most relevant functionalities for AI orchestration and walk you through a practical example that demonstrates its effectiveness as an AI orchestration platform:

ValRakitine Feb 9 2025 at 14:53

Eco-Methodological Sustainability

6 min

1.7K

Data Engineering * Developer Relations * IT Infrastructure * System Analysis and Design * Abnormal programming *

Analytics

Recovery Mode

In recent years, discussions about the environmental impact of information and communication technologies (ICTs) have largely revolved around hardware — data centers, electronic waste, and energy consumption. However, an equally important factor has been overlooked: the software development methodologies themselves.

When I read the UNCTAD “Digital Economy Report 2024”, I was struck by the complete absence of any mention of how programming methodologies impact sustainability. There was no discussion of whether developers use algorithm-centric or code-centric methodologies when creating software, nor how these choices affect the environment.

This realization led me to introduce the concept of Eco-Methodological Sustainability — a new approach that highlights the role of structured software development methodologies in shaping an environmentally sustainable future for the digital economy.

Falcon_eye Jan 11 2025 at 14:55

Apache Kafka… Basics to drive

Medium

5 min

4.3K

Data Engineering * Data storagingBig Data *

Review

Apache Kafka is a distributed event-streaming platform designed to handle real-time data feeds. It allows applications to publish, process, and subscribe to streams of data in a highly scalable, fault-tolerant manner.

ArcaneGamingcom Dec 5 2024 at 15:45

How to Choose the Optimal Authentication Solution for Your Application

Medium

3 min

API * Asterisk * Big Data * Data Engineering * Email-marketing *

Retrospective

In today's digital world, where applications process increasing amounts of sensitive data, ensuring reliable user authentication is critical. Authentication is the process of verifying the identity of a user who is trying to access a system. A properly chosen authentication method protects data from unauthorized access, prevents fraud, and increases user confidence.

However, with the development of technology, new authentication methods are emerging, and choosing the optimal solution can be difficult. This article will help developers and business owners understand the variety of authentication approaches and make informed choices.

Falcon_eye Jul 24 2024 at 21:15

How to set up Apache Airflow for 10 minutes via Docker

Medium

2 min

6.6K

Data Engineering * Python * Big Data *

Tutorial

Prerequisites:
1. Install Docker
2. Install VSCode

STEP BY STEP

1. Open VSCode that you previously installed and click on "Extensions" tab right on the menu bar, then type 'docker' to find proper extension and click "install":

Nikiz May 24 2024 at 09:47

Utilizing Wearable Digital Health Technologies for Cardiovascular Monitoring

Medium

17 min

1.6K

Manufacture and development of electronics * BiotechnologiesIOTData Engineering *

Case

Wearable Digital Health Technologies for Monitoring in Cardiovascular Medicine

This review article presents a three-part true-life clinical vignette that illustrates how digital health technology can aid providers caring for patients with cardiovascular disease. Specific information that would identify real patients has been removed or altered. Each vignette is followed by a discussion of how these methods were used in the care of the patient.

Ninil Apr 1 2024 at 19:10

User-defined aggregation functions in Spark

Medium

6 min

2.6K

Data Engineering * Big Data * Scala *

Below, we will discuss user-defined aggregation functions (UDAF) using org.apache.spark.sql.expressions.Aggregator, which can be used for aggregating groups of elements in a DataSet into a single value in any user-defined way.

Let’s start by examining an example from the official documentation that implements a simple aggregation

2 3