Data storage *

What we have, we store

101,42

Rating

ArticlesPostsNewsAuthors

habrconnect Feb 18 at 12:48

A VPS server for the price of a bag of chips: a review of the cheapest plans from Russian hosting providers

Easy

7 min

2.8K

HostingSystem administration * Server Administration * Data storage * Finance in IT

Review

Translation

Hello, Habr! I once conducted a small test of virtual machines from various hosting providers and compared them with each other — it turns out that five years have passed since then. And in that test, the conditions for all servers were the same, as similar configurations were being tested.

Today I'd like to talk about how the cheapest (in the price range of 100 to 300 rubles) offers from popular hosting providers behave.

habrconnect Feb 18 at 10:04

The Best Free Programs for Finding Duplicate Photos

Easy

7 min

1.1K

Image processing * Data storage * Software

Translation

Are you familiar with that feeling of slight panic when your laptop suddenly starts beeping plaintively, and a sinister warning appears on the screen: 'Disk almost full'? This happened to me recently too. I opened File Explorer and was stunned – my 1 TB external drive was filled to the brim – 95% full!

The culprits weren't movies or games, but a giant graveyard of photos. Twelve folders with the generic name 'DCIM,' mountains of screenshots I had copied five times 'just in case,' and heaps of nearly identical sunset shots taken in burst mode. Trying to manually find identical photos was like looking for a needle in a haystack the size of Siberia.

In a previous article, I discussed how to best sort photos, and even then I realized it was time to declare war on duplicates. And that moment has come. After testing more than 15 tools (and wasting a lot of nerves), I've selected 5 free programs that really help solve the problem. I'll share this experience with you.

Hayk-Asoyan Dec 4 2023 at 08:35

TeleDrive: Unleash Unlimited Cloud Storage with Telegram

Medium

2 min

14K

Big Data * Open data * Data storage *

Hey everyone! Today, I'll guide you through creating a boundless cloud storage solution on Telegram using TeleDrive. This open-source project is a game-changer, offering functionalities like Google Drive/OneDrive via the Telegram API.

trusted Nov 16 2022 at 07:00

Understanding the Differences Between Kafka and RabbitMQ: in Simple Terms

7 min

6.6K

Иннотех corporate blogProgramming * IT Infrastructure * Data storage * DevOps *

Translation

Software message brokers became the standard for creating complex systems. However not all IT specialists understand how these instruments work. Pavel Malygin, Lead System Analyst at Innotech, dives into the topic of message brokers and explains how they are used.

Master255 Mar 13 2021 at 00:18

Decentralized Torrent storage in DHT

5 min

2.9K

HostingDecentralized networks * Distributed systems * Data storage *

The DHT system has existed for many years now, and torrents along with it, which we successfully use to get any information we want.

Together with this system, there are commands to interact with it. There are not many of them, but only two are needed to create a decentralized database: put and get.

This is what will be discussed below...

Jessy_James Feb 27 2021 at 17:02

HDB++ TANGO Archiving System

3 min

1.6K

Open source * Python * IT Infrastructure * Data storage *

Tutorial

Translation

What is HDB++?

This is a TANGO archiving system, allows you to save data received from devices in the TANGO system.

Working with Linux will be described here (TangoBox 9.3 on base Ubuntu 18.04), this is a ready-made system where everything is configured.

What is the article about?

System architecture.
How to set up archiving.

It took me ~ 2 weeks to understand the architecture and write my own scripts for python for this case.

What is it for?

Allows you to store the history of the readings of your equipment.

You don't need to think about how to store data in the database.
You just need to specify which attributes to archive from which equipment.

VlK Dec 8 2020 at 16:02

The Rules for Data Processing Pipeline Builders

5 min

4.1K

Badoo corporate blogProgramming * DevOps * Data storage *

"Come, let us make bricks, and burn them thoroughly."
– legendary builders

You may have noticed by 2020 that data is eating the world. And whenever any reasonable amount of data needs processing, a complicated multi-stage data processing pipeline will be involved.

At Bumble — the parent company operating Badoo and Bumble apps — we apply hundreds of data transforming steps while processing our data sources: a high volume of user-generated events, production databases and external systems. This all adds up to quite a complex system! And just as with any other engineering system, unless carefully maintained, pipelines tend to turn into a house of cards — failing daily, requiring manual data fixes and constant monitoring.

For this reason, I want to share certain good engineering practises with you, ones that make it possible to build scalable data processing pipelines from composable steps. While some engineers understand such rules intuitively, I had to learn them by doing, making mistakes, fixing, sweating and fixing things again…

So behold! I bring you my favourite Rules for Data Processing Pipeline Builders.

AnnaPhc Aug 11 2020 at 16:05

IIoT platform databases – How Mail.ru Cloud Solutions deals with petabytes of data coming from a multitude of devices

11 min

2.3K

VK corporate blogData storage * IOTDatabase Administration * Tarantool *

Hello, my name is Andrey Sergeyev and I work as a Head of IoT Solution Development at Mail.ru Cloud Solutions. We all know there is no such thing as a universal database. Especially when the task is to build an IoT platform that would be capable of processing millions of events from various sensors in near real-time.

Our product Mail.ru IoT Platform started as a Tarantool-based prototype. I’m going to tell you about our journey, the problems we faced and the solutions we found. I will also show you a current architecture for the modern Industrial Internet of Things platform. In this article we will look into:

our requirements for the database, universal solutions, and the CAP theorem
whether the database + application server in one approach is a silver bullet
the evolution of the platform and the databases used in it
the number of Tarantools we use and how we came to this

+19

alexey_zz May 7 2020 at 11:00

Bcache against Flashcache for Ceph Object Storage

11 min

3.6K

Selectel corporate blogData storage * Server Administration * IT Infrastructure *

Fast SSDs are getting cheaper every year, but they are still smaller and more expensive than traditional HDD drives. But HDDs have much higher latency and are easily saturated. However, we want to achieve low latency for the storage system, and a high capacity too. There’s a well-known practice of optimizing performance for big and slow devices — caching. As most of the data on a disk is not accessed most of the time but some percentage of it is accessed frequently, we can achieve a higher quality of service by using a small cache.

Server hardware and operating systems have a lot of caches working on different levels. Linux has a page cache for block devices, a dirent cache and an inode cache on the filesystem layer. Disks have their own cache inside. CPUs have caches. So, why not add one more persistent cache layer for a slow disk?

+16

parthiba Mar 24 2020 at 10:05

Why Enterprise Chat Apps isn’t built on Server-side Database like Hangouts, Slack, & Hip chat?

3 min

3.5K

IT Infrastructure * Server optimization * Data storage *

From sandbox

One of the most significant tools for any organization to smoothen their collaborative world is only through a real-time chat application whether the conversation takes place on mobile or desktop. Hangouts, Slack and Hipchat have been in action for businesses to establish a decent conversation between their internal employees and clients right from small-scale to enterprises.

Those big players come into play where there requires team collaboration. The big players are built on a server-side database where the messages shared from one device to another is stored in their server database. Ultimately, this results in storing a huge amount of data within the server-side database (Cloud-database).

The consumption of cloud storage will be pretty high. The client-side database is more efficient where the messages relayed is stored in the client device. The messages will be queued to minimize the consumption of data in the server.

mrospax Feb 10 2020 at 11:36

A Brief Comparison of the SDS Architectures for Virtualization

6 min

3.8K

Open source * IT Infrastructure * Data storage * Development for Linux *

Translation

The search for a suitable storage platform: GlusterFS vs. Ceph vs. Virtuozzo Storage

This article outlines the key features and differences of such software-defined storage (SDS) solutions as GlusterFS, Ceph, and Virtuozzo Storage. Its goal is to help you find a suitable storage platform.

Gluster

Let’s start with GlusterFS that is often used as storage for virtual environments in open-source-based hyper-converged products with SDS. It is also offered by Red Hat alongside Ceph.
GlusterFS employs a stack of translators, services that handle file distribution and other tasks. It also uses services like Brick that handle disks and Volume that handle pools of bricks. Next, the DHT (distributed hash table) service distributes files into groups based on hashes.
Note: We’ll skip the sharding service due to issues related to it, which are described in linked articles.

When a file is written onto GlusterFS storage, it is placed on a brick in one piece and copied to another brick on another server. The next file will be placed on two or more other bricks. This works well if the files are of about the same size and the volume consists of a single group of bricks. Otherwise the following issues may arise:

mt144 Oct 16 2019 at 12:10

Tarantool Data Grid: Architecture and Features

6 min

2.9K

VK corporate blogData storage * High performance * Tarantool * Lua *

In 2017, we won the competition for the development of the transaction core for Alfa-Bank's investment business and started working at once. (Vladimir Drynkin, Development Team Lead for Alfa-Bank's Investment Business Transaction Core, spoke about the investment business core at HighLoad++ 2018.) This system was supposed to aggregate transaction data in different formats from various sources, unify the data, save it, and provide access to it.

In the process of development, the system evolved and extended its functions. At some point, we realized that we created something much more than just application software designed for a well-defined scope of tasks: we created a system for building distributed applications with persistent storage. Our experience served as a basis for the new product, Tarantool Data Grid (TDG).

I want to talk about TDG architecture and the solutions that we worked out during the development. I will introduce the basic functions and show how our product could become the basis for building turnkey solutions.

+34

NikZanyat Oct 1 2019 at 05:48

Quintet instead of Byte — data storage and retrieval approach

13 min

2.6K

Data storage * Programming * System Analysis and Design * SQL * IT Standards *

Quintet is a way to present atomic pieces of data indicating their role in the business area. Quintets can describe any item, while each of them contains complete information about itself and its relations to other quintets. Such description does not depend on the platform used. Its objective is to simplify the storage of data and to improve the visibility of their presentation.

We will discuss an approach to storing and processing information and share some thoughts on creating a development platform in this new paradigm. What for? To develop faster and in shorter iterations: sketch your project, make sure it is what you thought of, refine it, and then keep refining the result.

The quintet has properties: type, value, parent, and order among the peers. Thus, there are 5 components including the identifier. This is the simplest universal form to record information, a new standard that could potentially fit any programming demands. Quintets are stored in the file system of the unified structure, in a continuous homogeneous indexed bulk of data. The quintet data model — a data model that describes any data structure as a single interconnected list of basic types and terms based on them (metadata), as well as instances of objects stored according to this metadata (data).

adam4leos Sep 10 2019 at 12:05

Bypassing LinkedIn Search Limit by Playing With API

7 min

18K

API * Data storage * JavaScript * Reverse engineering * Social networks and communities

Translation

[Because my extension got a lot of attention from the foreign audience, I translated my original article into English].

Limit

Being a top-rated professional network, LinkedIn, unfortunately, for free accounts, has such a limitation as Commercial Use Limit (CUL). Most likely, you, same as me until recently, have never encountered and never heard about this thing.

The point of the CUL is that when you search people outside your connections/network too often, your search results will be limited with only 3 profiles showing instead of 1000 (100 pages with 10 profiles per page by default). How ‘often’ is measured nobody knows, there are no precise metrics; the algorithm decides it based on your actions – how frequently you’ve been searching and how many connections you’ve been adding. The free CUL resets at midnight PST on the 1st of each calendar month, and you get your 1000 search results again, for who knows how long. Of course, Premium accounts have no such limit in place.

However, not so long ago, I’ve started messing around with LinkedIn search for some pet-project, and suddenly got stuck with this CUL. Obviously, I didn’t like it that much; after all, I haven’t been using the search for any commercial purposes. So, my first thought was to explore this limit and try to bypass it.

[Important clarification — all source materials in this article are presented solely for informational and educational purposes. The author doesn't encourage their use for commercial purposes.]

msgeek Jul 11 2019 at 07:00

GitHub Package Registry will support Swift packages

1 min

Microsoft corporate blogGit * GitHub * Swift * Data storage *

On May 10, we announced the limited beta of GitHub Package Registry, a package management service that makes it easy to publish public or private packages next to your source code. It currently supports familiar package management tools: JavaScript (npm), Java (Maven), Ruby (RubyGems), .NET (NuGet), and Docker images, with more to come.

Today we’re excited to announce that we’ll be adding support for Swift packages to GitHub Package Registry. Swift packages make it easy to share your libraries and source code across your projects and with the Swift community.

gerold103 Mar 7 2019 at 06:57

VShard — horizontal scaling in Tarantool

14 min

3.1K

VK corporate blogProgramming * Lua * Tarantool * Data storage *

Hi, my name is Vladislav, and I am a member of the Tarantool development team. Tarantool is a DBMS and an application server all in one. Today I am going to tell the story of how we implemented horizontal scaling in Tarantool by means of the VShard module.

Some basic knowledge first.

There are two types of scaling: horizontal and vertical. And there are two types of horizontal scaling: replication and sharding. Replication ensures computational scaling whereas sharding is used for data scaling.

Sharding is also subdivided into two types: range-based sharding and hash-based sharding.

Range-based sharding implies that some shard key is computed for each cluster record. The shard keys are projected onto a straight line that is separated into ranges and allocated to different physical nodes.

Hash-based sharding is less complicated: a hash function is calculated for each record in a cluster; records with the same hash function are allocated to the same physical node.

I will focus on horizontal scaling using hash-based sharding.

+14