Big Data *

Everything about big data

ArticlesPostsNewsAuthors

@PhoenixLi Nov 4 at 06:05

StarRocks vs. ClickHouse, Apache Druid, and Trino

Easy

8 min

7.4K

Data Engineering * Big Data * SQL *

Analytics

In the big data era, data is one of the most valuable assets for enterprises. The ultimate goal of data analytics is to power swift, agile business decision making. As database technologies advance at a breathtaking pace in recent years, a large number of excellent database systems have emerged. Some of them are impressive in wide-table queries but do not work well in complex queries. Some support flexible multi-table queries but are held back by slow query speed.

Each type of data has a data model that best represents them. However, in real business scenarios, there is no such thing as ultra-fast data analytics under the perfect data model. Big data engineers sometimes have to make compromises on data models. Such compromises may cause long latency in complex queries or damage the real-time query performance because engineers must take the trouble to convert complex data models into flat tables.

New business requirements put forward new challenges for database systems. A good OLAP database system must be able to deliver excellent performance in both wide-table and multi-table scenarios. This system must also reduce the workload of big data engineers and enable customers to query data of any dimension in real time without worrying about data construction.

@PhoenixLi Oct 30 at 03:18

Comparison: StarRocks vs Apache Druid

Easy

5 min

8.2K

Data Engineering * Open source * Big Data * SQL *

Analytics

Apache Druid has been a staple for real-time analytics. However, with evolving and sophisticated analytics demands, it has faced challenges in satisfying modern data performance needs. Enter StarRocks, a high-performance, open-source analytical database, designed to adeptly meet the advanced analytics needs of contemporary enterprises by offering robust capabilities and performance.

In this article, we’ll explore the functionalities, strengths, and challenges of both Apache Druid and StarRocks. Using practical examples and benchmark results, we aim to guide you in identifying which database might best meet your data needs.

@melanny20 Oct 22 at 14:03

4 best tips to building high-quality data products from SYNQ

Easy

6 min

11K

Postgres Professional corporate blogData Engineering * Big Data *

Tutorial

Translation

The “test everything” principle doesn’t improve data quality — it destroys it. Hundreds of useless alerts create noise that drowns out truly important signals, and the team stops responding to them. Google and Monzo have already moved away from this approach.

Here’s how to shift from blanket testing to targeted checks at nodes with the greatest impact radius — and why one well-placed test at the source is worth more than a hundred checks downstream.

@SergeyProkhorenko Sep 2 at 10:31

6NF File Format

Medium

2 min

16K

Data Engineering * Big Data * ERP-systems * SQL *

Analytics

Filename Extension: .6nf

6NF File Format is a new bitemporal, sixth-normal-form (6NF)-inspired data exchange format designed for DWH and for reporting. It replaces complex hierarchical formats like XBRL, XML, JSON, and YAML

-2

@Falcon_eye Jan 11 at 14:55

Apache Kafka… Basics to drive

Medium

5 min

1.5K

Big Data * Data storagingData Engineering *

Review

Apache Kafka is a distributed event-streaming platform designed to handle real-time data feeds. It allows applications to publish, process, and subscribe to streams of data in a highly scalable, fault-tolerant manner.

@ArcaneGamingcom Dec 5 2024 at 15:45

How to Choose the Optimal Authentication Solution for Your Application

Medium

3 min

1.1K

API * Asterisk * Big Data * Data Engineering * Email-marketing *

Retrospective

In today's digital world, where applications process increasing amounts of sensitive data, ensuring reliable user authentication is critical. Authentication is the process of verifying the identity of a user who is trying to access a system. A properly chosen authentication method protects data from unauthorized access, prevents fraud, and increases user confidence.

However, with the development of technology, new authentication methods are emerging, and choosing the optimal solution can be difficult. This article will help developers and business owners understand the variety of authentication approaches and make informed choices.

@Falcon_eye Jul 24 2024 at 21:15

How to set up Apache Airflow for 10 minutes via Docker

Medium

2 min

2.5K

Data Engineering * Python * Big Data *

Tutorial

Prerequisites:
1. Install Docker
2. Install VSCode

STEP BY STEP

1. Open VSCode that you previously installed and click on "Extensions" tab right on the menu bar, then type 'docker' to find proper extension and click "install":

-2

@profleaddev May 20 2024 at 15:21

New ChatGPT-4o: A Game-Changer That Could Replace Data Analysts, Demo Included

Easy

3 min

3.5K

Artificial IntelligenceData visualization * Big Data *

Opinion

In this article, I’m going to discuss something really important. If you’re a data analyst or you want to learn data analysis, please watch this video till the end because it’s really important.

@Ninil Apr 1 2024 at 19:10

User-defined aggregation functions in Spark

Medium

6 min

Data Engineering * Big Data * Scala *

Below, we will discuss user-defined aggregation functions (UDAF) using org.apache.spark.sql.expressions.Aggregator, which can be used for aggregating groups of elements in a DataSet into a single value in any user-defined way.

Let’s start by examining an example from the official documentation that implements a simple aggregation

@profleaddev Feb 28 2024 at 15:06

Master Data Analysis with ChatGPT — How to Analyze Anything (Beginners Guide)

Easy

3 min

3.5K

Big Data * Data visualization * Artificial Intelligence

Tutorial

Today we’re diving into an exciting feature within ChatGPT that has the potential to enhance your productivity by 10, 20, 30, or even 40%. If you’re keen on learning how to leverage this feature to your advantage, make sure to read this article until the end. This feature stands out because it allows you to analyze almost anything by uploading your data and posing various questions to ChatGPT. Whether it's business data, your resume, or any other information you wish to explore, ChatGPT is here to deliver answers based on your specific dataset.

@Hayk-Asoyan Dec 4 2023 at 08:35

TeleDrive: Unleash Unlimited Cloud Storage with Telegram

Medium

2 min

13K

Big Data * Open data * Data storage *

Hey everyone! Today, I'll guide you through creating a boundless cloud storage solution on Telegram using TeleDrive. This open-source project is a game-changer, offering functionalities like Google Drive/OneDrive via the Telegram API.

@vladimirusmith Jun 29 2023 at 12:36

ChatGPT to Help You Become a 10x Programmer

Easy

2 min

7.7K

Artificial IntelligenceProgramming * Big Data *

Tutorial

I believe that every programmer has at least once heard about ChatGPT and its marvelous abilities to process, calculate and create huge amounts of data; if not, go check out this Wikipedia article - https://en.wikipedia.org/wiki/ChatGPT.

Can you imagine that some 50 years ago people could not even believe that there may be something artificial surpassing humans in so many areas? Nowadays, we have this marvel at the distance of a few tabs on a phone screen or a keyboard; however, there is still a sadly large number of people who do not fully—if at all— utilize all the perks of ChatGPT in their lines of work. This is mostly related either to people's reluctance to learn new technologies or the fear of losing coding skills they have previously gained—which is not the case with using ChatGPT properly.

In this article I want to give you some of the most useful uses of ChatGPT for your coding work. Remember, there is nothing shameful in using the AI, since this the development and further implementation of it in our day-to-day life is inevitable, so we should start adapting to it as early as we can to take the full advantage of this "magical" technology. Let's get started.

@Z1at Jun 13 2023 at 17:51

Mathematical meaning of principal component analysis (PCA)

Medium

7 min

3.1K

Big Data * Data Engineering *

This article aims at explaining the mathematical sense of the Principal Component Analysis (PCA) in practice.

@AmiraB2 May 30 2023 at 07:53

Feature Engineering: Techniques and Best Practices for Data Scientists

8 min

5.3K

Data Engineering * Big Data *

Tutorial

The most important stage in the data science process is feature engineering, which entails turning raw data into useful features that might enhance the performance of machine learning models. It calls for creativity, data-driven thinking, and domain expertise. Data scientists can improve the prediction capability of their models and find hidden patterns in the data by choosing, combining, and inventing relevant features. Handling missing data, scaling features, encoding categorical variables, constructing interaction terms, and other procedures are examples of feature engineering techniques. The best practises involve investigating the data, testing and improving features iteratively, and applying domain knowledge to draw out important information. The accuracy and effectiveness of machine learning models are significantly influenced by effective feature engineering.

@N-Cube Feb 15 2023 at 13:35

PyGMTSAR is Next Generation Interferometric Synthetic Aperture Radar (InSAR) Software for Everyone

6 min

3.9K

Do you need to produce satellite interferometry results for your work or study? Or should you find the way to process terabytes of radar data on your common laptop? Maybe you aren't confident about the installation and usage of the required software. Fortunately, there is the next generation of satellite interferometry products available for you. Beginners can build the results easily and advanced users might work on huge datasets. Open Source software PyGMTSAR is available on GitHub for developers and on DockerHub for advanced users and on Google Colab for everyone. This is the cloud-ready product, and it works the same as do you run it locally on your old laptop as on powerful cloud servers.

@kotsev96 Feb 10 2023 at 13:38

Message broker selection cheat sheet: Kafka vs RabbitMQ vs Amazon SQS

Medium

6 min

15K

Java * Go * Big Data *

This is a series of articles dedicated to the optimal choice between different systems on a real project or an architectural interview.

At work or at a System Design interview, you often have to choose the best message broker. I plunged into this issue and will tell you what and why. What is better in each case, what are the advantages and disadvantages of these systems, and which one to choose, I will show with several examples.

@m31 Jan 26 2023 at 17:43

Data Phoenix Digest — ISSUE 2.2023

2 min

1.7K

Python * Big Data * Machine learning * DevOps * Artificial Intelligence

Digest

Video recording of our webinar about dstack and reproducible ML workflows, AVL binary tree operations, Ultralytics YOLOv8, training XGBoost, productionize ML models, introduction to forecasting ensembles, domain expansion of image generators, Muse, X-Decoder, Box2Mask, RoDynRF, AgileAvatar and more.

@Evrone Nov 16 2022 at 11:13

How we designed the user interface for an enterprise analytical system

5 min

1.5K

Singula Team corporate blogBig Data * CGI * Data Engineering *

In 2021, we were contacted by an industrial plant that was faced with the need to create a system for analyzing processes in its production. The enterprise team studied ready-made solutions, but none of the analytics system designs fully covered the required functionality. So they turned to us with a request to develop their own analytical system that would collect data from all machines and allow it to be analyzed to see bottlenecks in production. For this project, we created a data-driven UI/UX design and also developed a web-based interface for the equipment monitoring system.

@vldmrvslv Jun 29 2022 at 14:24

Detecting attempts of mass influencing via social networks using NLP. Part 2

3 min

1.6K

Big Data * Data Mining * Twitter API * Natural Language Processing * Python *

Tutorial

In Part 1 of this article, I built and compared two classifiers to detect trolls on Twitter. You can check it out here.

Now, time has come to look more deeply into the datasets to find some patterns using exploratory data analysis and topic modelling.

EDA

To do just that, I first created a word cloud of the most common words, which you can see below.

@vldmrvslv Jun 29 2022 at 14:20

Detecting attempts of mass influencing via social networks using NLP. Part 1

5 min

2.1K

Big Data * Python * Data Mining * Natural Language Processing * Twitter API *

Tutorial

During the last decades, the world’s population has been developing as an information society, which means that information started to play a substantial end-to-end role in all life aspects and processes. In view of the growing demand for a free flow of information, social networks have become a force to be reckoned with. The ways of war-waging have also changed: instead of conventional weapons, governments now use political warfare, including fake news, a type of propaganda aimed at deliberate disinformation or hoaxes. And the lack of content control mechanisms makes it easy to spread any information as long as people believe in it.

Based on this premise, I’ve decided to experiment with different NLP approaches and build a classifier that could be used to detect either bots or fake content generated by trolls on Twitter in order to influence people.

In this first part of the article, I will cover the data collection process, preprocessing, feature extraction, classification itself and the evaluation of the models’ performance. In Part 2, I will dive deeper into the troll problem, conduct exploratory analysis to find patterns in the trolls’ behaviour and define the topics that seemed of great interest to them back in 2016.

Features for analysis

From all possible data to use (like hashtags, account language, tweet text, URLs, external links or references, tweet date and time), I settled upon English tweet text, Russian tweet text and hashtags. Tweet text is the main feature for analysis because it contains almost all essential characteristics that are typical for trolling activities in general, such as abuse, rudeness, external resources references, provocations and bullying. Hashtags were chosen as another source of textual information as they represent the central message of a tweet in one or two words.

2 3 4