Pull to refresh
80.99

Big Data *

Everything about big data

Show first
Rating limit
Level of difficulty

TeleDrive: Unleash Unlimited Cloud Storage with Telegram

Level of difficulty Medium
Reading time 2 min
Views 3.2K

Hey everyone! Today, I'll guide you through creating a boundless cloud storage solution on Telegram using TeleDrive. This open-source project is a game-changer, offering functionalities like Google Drive/OneDrive via the Telegram API.

Read more
Total votes 4: ↑4 and ↓0 +4
Comments 2

ChatGPT to Help You Become a 10x Programmer

Level of difficulty Easy
Reading time 2 min
Views 7K

I believe that every programmer has at least once heard about ChatGPT and its marvelous abilities to process, calculate and create huge amounts of data; if not, go check out this Wikipedia article - https://en.wikipedia.org/wiki/ChatGPT.

Can you imagine that some 50 years ago people could not even believe that there may be something artificial surpassing humans in so many areas? Nowadays, we have this marvel at the distance of a few tabs on a phone screen or a keyboard; however, there is still a sadly large number of people who do not fully—if at all— utilize all the perks of ChatGPT in their lines of work. This is mostly related either to people's reluctance to learn new technologies or the fear of losing coding skills they have previously gained—which is not the case with using ChatGPT properly.

In this article I want to give you some of the most useful uses of ChatGPT for your coding work. Remember, there is nothing shameful in using the AI, since this the development and further implementation of it in our day-to-day life is inevitable, so we should start adapting to it as early as we can to take the full advantage of this "magical" technology. Let's get started.

Read more
Rating 0
Comments 5

Feature Engineering: Techniques and Best Practices for Data Scientists

Reading time 8 min
Views 1.7K

The most important stage in the data science process is feature engineering, which entails turning raw data into useful features that might enhance the performance of machine learning models. It calls for creativity, data-driven thinking, and domain expertise. Data scientists can improve the prediction capability of their models and find hidden patterns in the data by choosing, combining, and inventing relevant features. Handling missing data, scaling features, encoding categorical variables, constructing interaction terms, and other procedures are examples of feature engineering techniques. The best practises involve investigating the data, testing and improving features iteratively, and applying domain knowledge to draw out important information. The accuracy and effectiveness of machine learning models are significantly influenced by effective feature engineering.

Read more
Rating 0
Comments 0

PyGMTSAR is Next Generation Interferometric Synthetic Aperture Radar (InSAR) Software for Everyone

Reading time 6 min
Views 2.4K

Do you need to produce satellite interferometry results for your work or study? Or should you find the way to process terabytes of radar data on your common laptop? Maybe you aren't confident about the installation and usage of the required software. Fortunately, there is the next generation of satellite interferometry products available for you. Beginners can build the results easily and advanced users might work on huge datasets. Open Source software PyGMTSAR is available on GitHub for developers and on DockerHub for advanced users and on Google Colab for everyone. This is the cloud-ready product, and it works the same as do you run it locally on your old laptop as on powerful cloud servers.


Read more →
Total votes 1: ↑1 and ↓0 +1
Comments 0

Message broker selection cheat sheet: Kafka vs RabbitMQ vs Amazon SQS

Level of difficulty Medium
Reading time 6 min
Views 7.7K

This is a series of articles dedicated to the optimal choice between different systems on a real project or an architectural interview.

At work or at a System Design interview, you often have to choose the best message broker. I plunged into this issue and will tell you what and why. What is better in each case, what are the advantages and disadvantages of these systems, and which one to choose, I will show with several examples.

Read more
Total votes 6: ↑5 and ↓1 +4
Comments 0

Data Phoenix Digest — ISSUE 2.2023

Reading time 2 min
Views 938

Video recording of our webinar about dstack and reproducible ML workflows, AVL binary tree operations, Ultralytics YOLOv8, training XGBoost, productionize ML models, introduction to forecasting ensembles, domain expansion of image generators, Muse, X-Decoder, Box2Mask, RoDynRF, AgileAvatar and more.

Read more
Total votes 1: ↑1 and ↓0 +1
Comments 0

How we designed the user interface for an enterprise analytical system

Reading time 5 min
Views 884

In 2021, we were contacted by an industrial plant that was faced with the need to create a system for analyzing processes in its production. The enterprise team studied ready-made solutions, but none of the analytics system designs fully covered the required functionality. So they turned to us with a request to develop their own analytical system that would collect data from all machines and allow it to be analyzed to see bottlenecks in production. For this project, we created a data-driven UI/UX design and also developed a web-based interface for the equipment monitoring system.

Read more
Total votes 5: ↑5 and ↓0 +5
Comments 0

Detecting attempts of mass influencing via social networks using NLP. Part 2

Reading time 3 min
Views 1K

In Part 1 of this article, I built and compared two classifiers to detect trolls on Twitter. You can check it out here.

Now, time has come to look more deeply into the datasets to find some patterns using exploratory data analysis and topic modelling.

EDA

To do just that, I first created a word cloud of the most common words, which you can see below.

Read more
Total votes 3: ↑3 and ↓0 +3
Comments 0

Detecting attempts of mass influencing via social networks using NLP. Part 1

Reading time 5 min
Views 1.4K

During the last decades, the world’s population has been developing as an information society, which means that information started to play a substantial end-to-end role in all life aspects and processes. In view of the growing demand for a free flow of information, social networks have become a force to be reckoned with. The ways of war-waging have also changed: instead of conventional weapons, governments now use political warfare, including fake news, a type of propaganda aimed at deliberate disinformation or hoaxes. And the lack of content control mechanisms makes it easy to spread any information as long as people believe in it.  

Based on this premise, I’ve decided to experiment with different NLP approaches and build a classifier that could be used to detect either bots or fake content generated by trolls on Twitter in order to influence people. 

In this first part of the article, I will cover the data collection process, preprocessing, feature extraction, classification itself and the evaluation of the models’ performance. In Part 2, I will dive deeper into the troll problem, conduct exploratory analysis to find patterns in the trolls’ behaviour and define the topics that seemed of great interest to them back in 2016.

Features for analysis

From all possible data to use (like hashtags, account language, tweet text, URLs, external links or references, tweet date and time), I settled upon English tweet text, Russian tweet text and hashtags. Tweet text is the main feature for analysis because it contains almost all essential characteristics that are typical for trolling activities in general, such as abuse, rudeness, external resources references, provocations and bullying. Hashtags were chosen as another source of textual information as they represent the central message of a tweet in one or two words. 

Read more
Total votes 3: ↑3 and ↓0 +3
Comments 0

Extending and moving a ZooKeeper ensemble

Reading time 3 min
Views 2K

    Once upon a time our DBA team had a task. We had to move a ZooKeeper ensemble which we had been using for Clickhouse cluster. Everyone is used to moving an ensemble by moving its data files. It seems easy and obvious but our Clickhouse cluster had more than 400 TB replicated data. All replication information had been collected in ZooKeeper cluster from the very beginning. At the end of the day we couldn’t miss even a row of data. Then we looked for information on the internet. Unfortunately there was a good tutorial about 3.4.5 and didn’t fit our version 3.6.2. So we decided to use “the extending” for moving our ensemble.

Read more
Rating 0
Comments 0

We have published a model for text repunctuation and recapitalization for four languages

Reading time 7 min
Views 5.9K


Open In Colab


Working with speech recognition models we often encounter misconceptions among potential customers and users (mostly related to the fact that people have a hard time distinguishing substance over form). People also tend to believe that punctuation marks and spaces are somehow obviously present in spoken speech, when in fact real spoken speech and written speech are entirely different beasts.


Of course you can just start each sentence with a capital letter and put a full stop at the end. But it is preferable to have some relatively simple and universal solution for "restoring" punctuation marks and capital letters in sentences that our speech recognition system generates. And it would be really nice if such a system worked with any texts in general.


For this reason, we would like to share a system that:


  • Inserts capital letters and basic punctuation marks (dot, comma, hyphen, question mark, exclamation mark, dash for Russian);
  • Works for 4 languages (Russian, English, German, Spanish) and can be extended;
  • By design is domain agnostic and is not based on any hard-coded rules;
  • Has non-trivial metrics and succeeds in the task of improving text readability;

To reiterate — the purpose of such a system is only to improve the readability of the text. It does not add information to the text that did not originally exist.

Read more →
Total votes 4: ↑3 and ↓1 +2
Comments 0

Millions of orders per second matching engine testing

Reading time 4 min
Views 8.1K

I had some experience in the matching engine development for cryptocurrency exchange some time ago. That was an interesting and challenging experience. I developed it in clear C++ from scratch. The testing of it is also quite a challenging task. You need to get data for testing, perform testing, collect some statistics, and at last, analyze collected data to find weak points and bottlenecks. I want to focus on testing the C++ matching engine and show how testing can give insights for optimizations even without the need to change the code. The matching engine I developed can do more than 1’000’000 TPS (transactions per second) and is 10x times faster than the matching engine of the Binance cryptocurrency exchange (see one post on Binance Blog).

Read more
Total votes 5: ↑5 and ↓0 +5
Comments 1

Big Data Tools with IntelliJ IDEA Ultimate, PyCharm Professional, DataGrip 2021.3 EAP, and DataSpell Support

Reading time 1 min
Views 1.8K

Recently we released a new build of the Big Data Tools plugin that is compatible with the 2021.3 versions of IntelliJ IDEA and PyCharm. DataGrip 2021.3 support will be available immediately after the release in October. The plugin also supports our new data science IDE – JetBrains DataSpell. If you still use previous versions, now is the perfect time to upgrade both your IDE and the plugin. 

This year, we introduced a number of new features as well as some features that have been there for a while, for example, running Spark Submit with a run configuration.

Here’s a list of the key improvements:

Read more
Rating 0
Comments 0

Data Phoenix Digest — 01.07.2021

Reading time 5 min
Views 1.9K

We at Data Science Digest have always strived to ignite the fire of knowledge in the AI community. We’re proud to have helped thousands of people to learn something new and give you the tools to push ahead. And we’ve not been standing still, either.

Please meet Data Phoenix, a Data Science Digest rebranded and risen anew from our own flame. Our mission is to help everyone interested in Data Science and AI/ML to expand the frontiers of knowledge. More news, more updates, and webinars(!) are coming. Stay tuned!

The new issue of the new Data Phoenix Digest is here! AI that helps write code, EU’s ban on biometric surveillance, genetic algorithms for NLP, multivariate probabilistic regression with NGBoosting, alias-free GAN, MLOps toys, and more…

If you’re more used to getting updates every day, subscribe to our Telegram channel or follow us on social media: TwitterFacebook.

Read more
Total votes 1: ↑0 and ↓1 -1
Comments 0

DataScience Digest — 24.06.21

Reading time 5 min
Views 1.8K

The new issue of DataScienceDigest is here!

The impact of NLP and the growing budgets to drive AI transformations. How Airbnb standardized metric computation at scale. Cross-Validation, MASA-SR, AgileGAN, EfficientNetV2, and more.

If you’re more used to getting updates every day, subscribe to our Telegram channel or follow us on social media: Twitter, LinkedIn, Facebook.

Read more
Total votes 2: ↑1 and ↓1 0
Comments 0

You are standing at a red light at an empty intersection. How to make traffic lights smarter?

Reading time 14 min
Views 2K

Types of smart traffic lights: adaptive and neural networks

Adaptive works at relatively simple intersections, where the rules and possibilities for switching phases are quite obvious. Adaptive management is only applicable where there is no constant loading in all directions, otherwise it simply has nothing to adapt to – there are no free time windows. The first adaptive control intersections appeared in the United States in the early 70s of the last century. Unfortunately, they have reached Russia only now, their number according to some estimates does not exceed 3,000 in the country.

Neural networks – a higher level of traffic regulation. They take into account a lot of factors at once, which are not even always obvious. Their result is based on self-learning: the computer receives live data on the bandwidth and selects the maximum value by all possible algorithms, so that in total, as many vehicles as possible pass from all sides in a comfortable mode per unit of time. How this is done, usually programmers answer – we do not know, the neural network is a black box, but we will reveal the basic principles to you…

Adaptive traffic lights use, at least, leading companies in Russia, rather outdated technology for counting vehicles at intersections: physical sensors or video background detector. A capacitive sensor or an induction loop only sees the vehicle at the installation site-for a few meters, unless of course you spend millions on laying them along the entire length of the roadway. The video background detector shows only the filling of the roadway with vehicles relative to this roadway. The camera should clearly see this area, which is quite difficult at a long distance due to the perspective and is highly susceptible to atmospheric interference: even a light snowstorm will be diagnosed as the presence of traffic – the background video detector does not distinguish the type of detection.

Read more
Total votes 3: ↑3 and ↓0 +3
Comments 0

Authors' contribution