Format-X22 5 июл в 15:47

Dual Reliability Requests

Средний

9 мин

695

Блог компании EXANTEСерверная оптимизация*Сетевые технологии*IT-стандарты*Финансы в IT

Обзор

We are requesting 20,000,000 euro to place 900 orders on the exchange. What could go wrong?

Today, I will explain how to avoid losing billions in client money when executing large transactions on the exchange. This discussion focuses on an often overlooked and largely invisible problem that arises when handling large requests, particularly HTTP requests, which may not be fully executed. It's surprising how little attention is given to this issue and how few tools there are to address it.

Our task was to implement large-scale management of exchange orders, not just within a single exchange but globally, and to ensure it operates reliably. In this story, you'll encounter clients, servers, and cats. Stories are always more enjoyable with cats.

Request Variants

On Requests

First, let's define the existing request types, from the client's perspective.

Let's start with GET requests. These are usually straightforward. Typically, something obtained once can be requested again. In very rare cases, there are special links that can be requested only once—they contain unique recovery codes or similar items. However, due to search robots and other data collection systems, the lifetime of a link is considered instead of the number of visits. As a result, GET requests are generally quite safe and stable. Therefore, advanced reliability measures are not particularly necessary here. If a client sends a GET request and something goes wrong, they can simply send the request again.

However, there are requests that modify our data, usually POST, PUT, PATCH, and DELETE requests. Resending these requests can cause problems. What if it results in the repeated purchase of shares on the stock exchange? That's definitely different from what we wanted.

Nuances of DELETE

The risks of changes are clear, but why are deletions dangerous? It might seem that DELETE requests are harmless if accidentally sent twice, but there are dangers to consider.

For example, consider task queues. Everything is fine if we need to delete a task using a specific, unique number. However, if the client sends a request to delete the first element of the queue and something goes wrong, causing the request to be sent again, there's a chance we will delete two tasks instead of one. If the task is financially sensitive, this is unacceptable. This is why DELETE requests are also potentially dangerous if executed twice.

On Data-Changing Requests

There are many nuances, even if the client provides an ID.

Let's revisit the task queue example and consider editing a task. How can double editing cause issues? It seems that the task ID should protect us—editing the same task twice should yield the same result.

Unfortunately, it's not that simple. The server may not only change the data but also perform additional actions. While double logging might not be a concern, actions that alter behaviour are problematic. What if one of the fields is the task status and we have business logic prepared for status changes? The code will run twice.

We could prevent code execution if the status doesn't change, and that would work fine until we encounter a race condition. The real trouble begins when a third request from another client manages to slip in between our two requests and changes the data to a different value. If our business logic involves moving tasks within a queue, this can cause significant issues.

In short, we can devise workarounds for a while, but the only truly safe solution is to eliminate the possibility of erroneous double execution entirely. We need to ensure that the problem of double execution is resolved at its core.

Mistakes Variants

Two Out of Three

Let's talk about error scenarios from the client's perspective.

There are two main types of errors to consider:

Client-Side Error: The client sends a request, but it doesn't go through due to network drops, packet corruption, or excessive data traffic.
Server-Side Error: The request reaches the server, but an internal error or similar issue prevents it from being processed.

At first glance, these errors don't seem problematic. If the client couldn't send the request, they simply send it again. Network issues are common and usually resolved with a retry. If the request reaches the server but isn't processed, the client receives an error response and can resend the request once the server is operational. The general concept is straightforward: either the request didn't reach the server, or it wasn't processed, so no double requests should occur.

However, the devil is in the details.

Details

The main pain point lies in a subtle, often unnoticed step that can drive those unfamiliar with such errors crazy. These errors are rare and seemingly obvious, but they usually hide among other issues.

Here's what can happen: the client sends a request that successfully reaches the server. The server processes the request and sends a response, but the response doesn't reach the client.

As a result, the client receives a network error, assumes the request wasn't processed, and sends it again. After all, it's just a network issue, what could go wrong? But here's the catch.

The request might be entirely valid, and according to business logic, we might be allowed to send requests twice with the same parameters. This was the case in the example of deleting the first task from a queue. The request parameters don't change, but we only wanted to delete one task.

How do we distinguish between a repeated request due to an error and a repeated request for a separate action?

On Out-of-the-Box Solutions

Tools to solve this problem are almost nonexistent in web frameworks and network request libraries. Can you think of a mechanism that protects against this? It’s a very painful error, though rare, too rare for most websites.

In the "ordinary internet," this might be something to overlook. Sending two cat pictures to a social media feed instead of one isn't a fatal problem. However, buying shares twice for a significant amount of money is a serious issue that demands attention.

Solutions

The good news is that there is a solution to these problems, and it has been thoroughly tested for use in industrial software. This solution consists of two parts: request uniquification and response repetition for the client. Let's break down both parts of the solution.

Uniquification

Request uniquification means that each request must be unique within a single interaction. An interaction is considered complete when a response is received, indicating that the request has been processed.

It doesn't matter if an error occurred; what matters is that the server explicitly responded, signaling the completion of the interaction. Network errors or 500 errors with unknown results do not count as complete. However, 4xx errors or 5xx errors indicate the request was processed but the server didn't take any action do count.

This is crucial because if the interaction isn't complete, the request must be resent until a result is received. Alternatively, we may acknowledge that we are in an indeterminate state, handle it differently and stop the requests.

The easiest solution is to include an UUID in each request as a parameter or header. We can always track the outcome by tagging the requests with a UUID. If the server receives a request with an already processed UUID, it doesn't process it again but sends a cached response. This brings us to the second part of the solution—response repetition.

Response Repetition

Response repetition means that if a request with the same UUID from the same interaction (quantum) arrives, we don't send an error response. Instead, we recognize that the previous response wasn't delivered and resend it.

This approach is very useful. It prevents confusing errors for the client, such as the data being delivered to the server, processed, and the result being sent back to the client. Avoiding such errors significantly improves the interaction.

We can log the repeated request or send a metric, for instance, to alert support engineers if a large number of repeated requests suddenly start hitting the server. This could indicate a network issue, but the client transparently receives the response, and we continue as normal.

It's not necessary to store responses for repeated requests indefinitely. Depending on the situation, a business-case-specific timeout can be defined, ranging from 10 minutes to an hour or even a day. The general idea is to store responses for a certain period. If the client resends the same ID within this timeframe, we return the stored response, even if it happens multiple times.

Layers and Security

Bonus: two-layer storage. It's possible to implement a dual-cache system where the UUID storage duration is longer than that of the response.

In this setup, we can promptly respond to client requests within a defined timeframe. If a request arrives after the data has been purged from the caches but before the UUID is deleted, we notify the client of an invalid request. It's also welcome to notify security in such cases.

This situation could indicate a bug or a potential Man-In-The-Middle (MITM) attack. Such behaviour may be considered suspicious—someone might have intercepted the data and attempted to reuse it, leveraging sessions or other methods. This can serve as an additional layer in the security system.

Additional Variants

UNIX-time

There are alternative solutions to this problem. Sometimes, instead of using UUIDs, clients send requests using a timestamp—a UNIX date indicating the number of seconds or milliseconds since the computer era began. This approach uses fewer bytes of data than UUIDs.

One drawback is the potential for collisions if requests are sent too frequently. However, this can be mitigated if the server has rate limiting and other load protection systems to control the number of requests per second.

Cyclic Overflow

Another approach is to use an incrementing number that resets periodically to zero, which can further reduce data traffic. However, this method requires the client to keep track of what has already been sent using a counter.

This option is highly economical and can save a significant amount of bytes, especially when dealing with protocols other than HTTP. However, it comes with drawbacks—it requires additional control. Throughout execution and the flow of requests, the potential for conflicts must be monitored. The smaller the number size before it resets, the greater the risk of collisions.

Transactional Integrity

Sometimes, the request ID also determines the sequence of operations. A request may not be executed immediately and may travel through different channels with varying delivery times. Alternatively, it might be stored in a buffer or pool of incoming requests for some time. When it finally executes, requests are queued based on their order number and processed sequentially.

In large distributed systems, we can ensure unique execution by rejecting requests with the same ID and allowing only the first or last identical request to proceed. While transactionality is the primary concern in such systems, addressing the issue of duplicate requests becomes an added benefit.

Without Caches?

We might consider a solution that omits response caching. In this approach, we send the request, receive confirmation from the server that it accepted the request, and then wait for the server to acknowledge that it received our acknowledgement of the successful request response... However, no matter how many such clarifying requests we send, this method will not work because one request could leave us in an uncertain state. Therefore, caches are definitely necessary.

On the Client Side

By the way, we can also apply our approach to the client side, whether it's a website, mobile app, or desktop application. One potential issue is handling unreliable backends that may inadvertently send duplicate data, such as notifications. While we may not have control over the sending process, we can manage how the client receives and processes this data. The root causes for these duplicates can often be traced to issues like loss of information about successful delivery somewhere along the way.

Standards

Within HTTP, a conditionally standardized header called X-Request-ID is used for request identification. This header can be included in both requests and responses. However, it's also common to include an ID within the request body. This flexibility is useful when not exclusively using HTTP or when business logic revolves around reading the request body alone.

Proxies might sometimes lose headers during transmission or overwrite our custom headers. The choice of working with IDs should align with the specific requirements of the business case at hand.

Frameworks

The solution might appear straightforward—use built-in framework features or an external library. However, it's not that simple.

Not all frameworks support such capabilities, and libraries addressing this issue are not widely adopted. Often, these libraries only handle parameter passing in headers.

Without addressing the broader context of sending requests from the client, processing them on the server, and managing response caching in a database. This complexity makes it challenging for a single solution to effectively address all aspects, leading to a lack of comprehensive out-of-the-box solutions in many cases.

Despite these challenges, there is good news. If you are looking for an opportunity to contribute to open-source development in this area, it's a relatively open niche. Your solution could potentially become the standard. Currently, the field is in chaos, but the problem remains unsolved and the consequences are costly.

To summarise

Bringing Everything Together

From the Client's Perspective:

At its core, we attach a unique request ID to each request. Upon receiving confirmation that the request has been accepted or processed, we proceed with our logic, marking the interaction as complete. If no response arrives or if the response indicates an incomplete interaction, we retry. If there is no response for a long time, we notify the user or handle it programmatically as an indeterminate result.

From the Server's Perspective:

On the server side, we validate the request ID. If it's new, we execute the request, store the response in the cache with this ID as the key, and send the response back. If a request with an existing ID is received, we return the cached response if it is ready. If not, we queue the request and respond once processing is finished. We respond only to the last request with the same ID, treating earlier requests as lost. After a safe period, old cached responses are deleted.

And there you have it—all issues are resolved, eliminating the threat of double requests caused by lost responses. Success!

Теги:

Хабы: