Monitoring systems are a vital tool for any system administrator, because they can be used to extract specific information from services, such that:
IT Infrastructure *
Infocenters + databases + communication systems
Network architecture at hyperscalers is a subject to constant innovation and is ever evolving to meet the demand. Network operators are constantly experimenting with solutions and finding new ways to keep it reliable and cost effective. Hyperscalers are periodically publishing their findings and innovations in a variety of scientific and technical groups.
The purpose of this article is to summarize the information about how hyperscalers design and manage networks. The goal here is to help connecting the dots, dissect and digest the data from a variety of sources including my personal experience working with hyperscalers.
DISCLAIMER: All information in this article is acquired from public resources. This article contains my own opinion which might not match and does not represent the opinion of my employer.
Microservices Architecture is a well-known pattern for building a complex system that consists of loosely coupled modules. It provides better scalability, and it is easier to develop a system in multiple teams so that they don’t interfere with each other too much. However, it is important to choose the right way of communication between the services. Otherwise, this kind of architecture can do more harm than good.
The PVS-Studio website turns 15 this year. This is quite significant for any internet resource. Back then, when our website was born, Russia announced 2006 as a year of humanities. That same year, in June, Denis Kryuchkov established a new platform, "Habrhabr" (now known as Habr). In November, Microsoft officially completed OS Windows Vista. That same month we registered the viva64.com domain.
We celebrated our domain's 10th anniversary with the website's redesign. After that, we would only change the resource capacity and features, but we'd never touch the design in any way. During this time, the number of articles grew so much that we needed to add tags to facilitate navigation. Right now we are also working on our YouTube channel. This means, you will see more and more new videos on our website as well. We keep adding new web pages at a tremendous rate, while the website's usability stays the same.
Time has come for big changes!
I have already written about AIOps and machine learning methods in working with IT incidents, about hybrid umbrella monitoring and various approaches to service management. Now I would like to share a very specific algorithm, how one can quickly get information about functioning conditions of business applications using synthetic monitoring and how to build, on this basis, the health metric of business services at no special cost. The story is based on a real case of implementing the algorithm into the IT system of one of the airlines.
Currently there are many APM systems, such as Appdynamics, Dynatrace, and others, having a UX control module inside that uses synthetic checks. And if the task is to learn about failures quicker than customers, I will tell you why all these APM systems are not needed. Also, nowadays health metrics are a fashionable feature of APM and I will show how you can build them without APM.
The year 2021 started on such a high note for Qrator Labs: on January 19, our company celebrated its 10th anniversary. Shortly after, in February, our network mitigated quite an impressive 750 Gbps DDoS attack based on old and well known DNS amplification. Furthermore, there is a constant flow of BGP incidents; some are becoming global routing anomalies. We started reporting in our newly made Twitter account for Qrator.Radar.
Nevertheless, with the first quarter of the year being over, we can take a closer look at DDoS attacks statistics and BGP incidents for January - March 2021.
In this article, we benchmark the performance of MySQL 8 default configuration vs. innodb_dedicated_server enabled configuration vs. the configuration recommended by MySQL Performance Tuning Service.
Types of smart traffic lights: adaptive and neural networks
Adaptive works at relatively simple intersections, where the rules and possibilities for switching phases are quite obvious. Adaptive management is only applicable where there is no constant loading in all directions, otherwise it simply has nothing to adapt to – there are no free time windows. The first adaptive control intersections appeared in the United States in the early 70s of the last century. Unfortunately, they have reached Russia only now, their number according to some estimates does not exceed 3,000 in the country.
Neural networks – a higher level of traffic regulation. They take into account a lot of factors at once, which are not even always obvious. Their result is based on self-learning: the computer receives live data on the bandwidth and selects the maximum value by all possible algorithms, so that in total, as many vehicles as possible pass from all sides in a comfortable mode per unit of time. How this is done, usually programmers answer – we do not know, the neural network is a black box, but we will reveal the basic principles to you…
Adaptive traffic lights use, at least, leading companies in Russia, rather outdated technology for counting vehicles at intersections: physical sensors or video background detector. A capacitive sensor or an induction loop only sees the vehicle at the installation site-for a few meters, unless of course you spend millions on laying them along the entire length of the roadway. The video background detector shows only the filling of the roadway with vehicles relative to this roadway. The camera should clearly see this area, which is quite difficult at a long distance due to the perspective and is highly susceptible to atmospheric interference: even a light snowstorm will be diagnosed as the presence of traffic – the background video detector does not distinguish the type of detection.
By the beginning of 2021, Qrator Labs filtering network expands to 14 scrubbing centers and a total of 3 Tbps filtering bandwidth capacity, with the San Paolo scrubbing facility fully operational in early 2021;
New partner services fully integrated into Qrator Labs infrastructure and customer dashboard throughout 2020: SolidWall WAF and RuGeeks CDN;
Upgraded filtering logic allows Qrator Labs to serve even bigger infrastructures with full-scale cybersecurity protection and DDoS attacks mitigation;
The newest AMD processors are now widely used by Qrator Labs in packet processing.
DDoS attacks were on the rise during 2020, with the most relentless attacks described as short and overwhelmingly intensive.
However, BGP incidents were an area where it was evident that some change was and still is needed, as there was a significant amount of devastating hijacks and route leaks.
In 2020, we began providing our services in Singapore under a new partnership and opened a new scrubbing center in Dubai, where our fully functioning branch is staffed by the best professionals to serve local customers.
What is HDB++?
This is a TANGO archiving system, allows you to save data received from devices in the TANGO system.
Working with Linux will be described here (TangoBox 9.3 on base Ubuntu 18.04), this is a ready-made system where everything is configured.
What is the article about?
- System architecture.
- How to set up archiving.
It took me ~ 2 weeks to understand the architecture and write my own scripts for python for this case.
What is it for?
Allows you to store the history of the readings of your equipment.
- You don't need to think about how to store data in the database.
- You just need to specify which attributes to archive from which equipment.
Before we start, I'd like to get on the same page with you. So, could you please answer? How much time will it take to:
- Create a new environment for testing?
- Update java & OS in the docker image?
- Grant access to servers?
It will take longer than you expect. I will explain why.
The National Internet Segment Reliability Research explains how the outage of a single Autonomous System might affect the connectivity of the impacted region with the rest of the world. Most of the time, the most critical AS in the region is the dominant ISP on the market, but not always.
As the number of alternate routes between AS’s increases (and do not forget that the Internet stands for “interconnected network” — and each network is an AS), so does the fault-tolerance and stability of the Internet across the globe. Although some paths are from the beginning more important than others, establishing as many alternate routes as possible is the only viable way to ensure an adequately robust network.
The global connectivity of any given AS, regardless of whether it is an international giant or regional player, depends on the quantity and quality of its path to Tier-1 ISPs.
Usually, Tier-1 implies an international company offering global IP transit service over connections with other Tier-1 providers. Nevertheless, there is no guarantee that such connectivity will be maintained all the time. For many ISPs at all “tiers”, losing connection to just one Tier-1 peer would likely render them unreachable from some parts of the world.
There would be no TL;DR in this article, sorry.
Those have been three months that genuinely changed the world. An entire lifeline passed from February, 1, when the coronavirus pandemics just started to spread outside of China and European countries were about to react, to April, 30, when nations were locked down in quarantine measures almost all over the entire world. We want to take a look at the repercussions, cyclic nature of the reaction and, of course, provide DDoS attacks and BGP incidents overview on a timeframe of three months.
In general, there seems to be an objective pattern in almost every country’s shift into the quarantine lockdown.
Fast SSDs are getting cheaper every year, but they are still smaller and more expensive than traditional HDD drives. But HDDs have much higher latency and are easily saturated. However, we want to achieve low latency for the storage system, and a high capacity too. There’s a well-known practice of optimizing performance for big and slow devices — caching. As most of the data on a disk is not accessed most of the time but some percentage of it is accessed frequently, we can achieve a higher quality of service by using a small cache.
Server hardware and operating systems have a lot of caches working on different levels. Linux has a page cache for block devices, a dirent cache and an inode cache on the filesystem layer. Disks have their own cache inside. CPUs have caches. So, why not add one more persistent cache layer for a slow disk?
After the second commit, each code becomes legacy. It happens because the original ideas do not meet actual requirements for the system. It is not bad or good thing. It is the nature of infrastructure & agreements between people. Refactoring should align requirements & actual state. Let me call it Infrastructure as Code refactoring.
Those big players come into play where there requires team collaboration. The big players are built on a server-side database where the messages shared from one device to another is stored in their server database. Ultimately, this results in storing a huge amount of data within the server-side database (Cloud-database).
The consumption of cloud storage will be pretty high. The client-side database is more efficient where the messages relayed is stored in the client device. The messages will be queued to minimize the consumption of data in the server.
Today, Microsoft and partners across 35 countries took coordinated legal and technical steps to disrupt one of the world’s most prolific botnets, called Necurs, which has infected more than nine million computers globally. This disruption is the result of eight years of tracking and planning and will help ensure the criminals behind this network are no longer able to use key elements of its infrastructure to execute cyberattacks.
A botnet is a network of computers that a cybercriminal has infected with malicious software, or malware. Once infected, criminals can control those computers remotely and use them to commit crimes. Microsoft’s Digital Crimes Unit, BitSight and others in the security community first observed the Necurs botnet in 2012 and have seen it distribute several forms of malware, including the GameOver Zeus banking trojan.
This is small cross-platform linux-distro with zabbix server. It's a simple way to deploy powerful monitoring system on ARM platfornms and x86_64.
Worked as firmware (non-changeable systemd image with config files), have web-interface for system management like network settings, password and other.
Who is interested
- System admins/engineers who need to fast deploy of zabbix server.
- Everyone, who want to deploy zabbix on ARM.