Active Restore: Can we Recover Faster? Much Faster?
Hi! My name is Daulet Tymbayev, and today I want to share my experience of developing a system that (theoretically) is able to recover a disk much faster than traditional recovery. Let’s start from the beginning to cover all the project stages.
Before joining Acronis I pursued my master's degree at Innopolis University (MSIT-SE program). Innopolis is a relatively new university and the MSIT-SE program is even newer. Nevertheless, it is built upon Carnegie Mellon University programs, and therefore it includes such things as industrial projects.
The ultimate goal of an industrial project is involving students in the real software development process and putting into practice their newly acquired theoretical knowledge. To do so, the university partners with companies such as Yandex, Acronis, MTS and dozens of others (as of 2018, the university had 144 partners). In terms of collaboration, companies purpose their projects to the university, and students choose one of them, according to their interests and technical skills level.
Two years ago, I was on “the other side”. I was a student, working on another Acronis project, and last year I was appointed as a technical consultant of a new team of students. I have presented the Active Restore project to the university. The idea behind Active Restore was invented by the Kernel team in Acronis, but the development process started together with Innopolis University.
Why do we need the Active Restore?
The traditional recovery process goes as follows: after a problem that compromises your computer, the user opens the available backup system interface and clicks on «the emergency button» to restore to a saved state. Then, after N minutes your system is ready to resume working.
As you can see, N has an impact on our business. This value represents the recovery time objective (from now on, RTO) and it depends on several factors, such as the connection speed (if the selected solution implements cloud recovery), hard drive bandwidth, size of the recovery files and many others.
But this traditional approach treats the whole process without prioritizing the files. If we do prioritize our system, initially we restore it with the necessary files to boot, and later on, bring back other things such as picture and video files.
The operating system is expected to start up with a fully-ready drive, hence the need for performing a series of check-ups to ensure drive consistency. If one or more system files are absent or corrupted, it simply won't boot up. To solve this problem we decided to put file-redirectors on the disk, which replaces absent or corrupted files. These file-redirectors are empty and that’s why it will not take much time to create them.
The recovery continues in the background. While the operating system works, “empty” files are filled with data. The background process considers the disk load and does not exceed preset limits. But the user or the OS can request some not-yet-recovered file. In this case, we launch the second recovery mode. The priority of the requested file is raised to the maximum and the recovery system transfers it to the disk as fast as possible. This way the OS gets the needed file, but with latency.
That’s the ideal situation. In the real world, there are a lot of problems and potential deadlocks. Together with Innopolis undergraduate students, we decided to research this recovery scenario, evaluate RTO advantages and clarify if such approach is possible in general. For that moment there were no such solutions on the market.
I decided to leave service development to Innopolis students. At Acronis, we started mini-filter FS driver development. Windows Kernel team was responsible for that. We had a plan:
- Launch a driver at the early OS startup stage,
- Launch a service when userspace is ready;
- The service processes driver requests and coordinates the further recovery operation.
Driver construction details
My colleagues will describe in detail the service in the next post. In this post, we will disclose some details about our driver development. Our mini-filter driver has 2 operation modes – when the system is started up in a normal state, and when there were a fault and a recovery is launched. Before user-space libraries and applications (and our service) are loaded, our driver acts the same way in any situation because the driver is not sure, in which state the system is. That’s why every create, read and write operation is logged with all the metadata. When the service goes online, the driver will provide these logs for further analysis.
In the case of normal operation, the service will tell the driver to work «on relax mode», which stops it from logging all metadata. Then the driver logs only disk changes and provides the service with these updates. The backup is maintained in the most actual state on the user-defined media by other Acronis tools. It can be cloud, remote, incremental or night-only backup but it is another story.
In the case of the recovery mode, the service tells the driver to work in the “Recovery” mode. During the recovery process, the driver intercepts the requests of partially recovered files checking whether those files are on disk and if they are readable.
If the file is absent, the mini-filter sends this information to the service, which raises the recovery priority for that file (because the recovery process is also performing in the background). So, the file jumps to the beginning of the queue. The service recovers the file (by itself or using other Acronis tools) and reports “OK” to the driver. The operating system can access the data, and the driver “releases” the original request to the disk.
If the recovery is not possible, and there’s no such file in the backup, the service reports to the driver. Our mini-filter driver ignores the system request and releases it. Then the OS or application receives the “file not found” error. But it’s ok if the file is really absent on the disk or in the backup, the user just asked for a non existing file by mistake.
Of course, the OS will work much slower, because the reading of any file or library takes several steps, possibly with remote data access. That is the price that we pay to be able to start working earlier, despite the ongoing restoration process.
We need to move deeper, much deeper...
The prototype proved the concept but we discovered we needed to dive deeper to avoid deadlocks. This appeared, for example, when OS requested different libraries in several threads, and the service looped back.
I’m currently working on finding a way to raise the Active Restore speed and enhancing system security. In the case that the system requests only part of the file, we developed an additional driver — a storage filter driver operating on the block level. The principle of the operation is the same. In the standard mode, the driver just logs block changes on the disk. However, while on restore mode, it tries to read blocks and request reprioritization from the service in case of failure. All the other parts of the system remain the same. The OS-level service doesn’t even know that there is another driver. Our main goal is to provide the OS with the necessary data, but there’s a field for further development because the service is still not able to operate on the block level.
The next phase is diving to the UEFI level with driver and Native Windows applications with service to start even faster. For that reason, we have developed the UEFI boot driver (DXE driver), which is started and killed even before the OS start-up. Stories about UEFI drivers, their construction, and installation will be discussed in the following posts. Subscribe for our blog and I will be happy to see your comments!
Только зарегистрированные пользователи могут участвовать в опросе. Войдите, пожалуйста.
Did you have a situation, when the recovery process was too long?
100,0%Didn’t think about that1