Managing hundreds of servers for load testing: autoscaling, custom monitoring, DevOps culture

    In the previous article, I talked about our load testing infrastructure. On average, we use about 100 servers to create a load, about 150 servers to run our service. All these servers need to be created, configured, started, deleted. To do this, we use the same tools as in the production environment to reduce the amount of manual work:

    • Terraform scripts for creating and deleting a test environment;
    • Ansible scripts for configuring, updating, starting servers;
    • In-house Python scripts for dynamic scaling, depending on the load.

    Thanks to the Terraform and Ansible scripts, all operations ranging from creating instances to starting servers are performed with only six commands:

    #launch the required instances in the AWS console
    ansible-playbook deploy-config.yml #update servers versions
    ansible-playbook start-application.yml #start our app on these servers
    ansible-playbook update-test-scenario.yml --ask-vault-pass #update the JMeter test scenario if it was changed
    infrastructure-aws-cluster/jmeter_clients:~# terraform apply #create JMeter servers for creating the load
    playbook start-jmeter-server-cluster.yml #start the JMeter cluster
    ansible-playbook start-stress-test.yml #start the test
    


    Dynamic server scaling


    We have more than a hundred thousand simultaneously active online users during peak hours. There is no point in keeping the full amount of servers running all the time, so we set up autoscaling for the board servers, which handle requests that are made when a user opens a whiteboard, and for the API servers, which handle all other API requests. Servers are now created and deleted when needed.

    This mechanism is handy for load testing: We can have just the minimum required number of servers by default, and when we run the test, this number will be automatically increased accordingly. We may have four board servers at the start and up to forty at the peak. Additionally, new servers are not created immediately, but only after the current servers are fully loaded. An example rule for creating new instances could be reaching 50 percent of the CPU usage. This allows us not to slow down the growth of virtual users in a test scenario and not to create unnecessary servers.

    An additional advantage of this approach is that thanks to dynamic scaling, we learn how much capacity we need for different numbers of users that we haven’t yet seen in the production environment.

    Collecting production-like metrics


    There are many tools and approaches for monitoring load tests, but we went our own way.

    We monitor the production environment using a standard technology stack: Logstash, Elasticsearch, Kibana, Prometheus, and Grafana. Our testing cluster is similar to the production cluster, so we decided to make the monitoring the same as in the production environment, with the same metrics. There are two reasons for that:

    • There’s no need to build a monitoring system from scratch; we already have a complete system;
    • We’re additionally testing the monitoring in the production environment: if during the monitoring of the test environment, we conclude that we do not have enough data to analyze the problem, it means that we will not have enough data when that problem occurs in the production environment.

    image

    What we include in the reports


    • Technical specification of the test booth;
    • Test scenario in a human-readable format;
    • A result that is understandable by both developers and managers;
    • General condition charts;
    • Charts that show a bottleneck or something that was affected by the optimization tested in the test.

    It is crucial to store all results in one place. This way, they can be easily compared with each other from test run to test run.

    We create reports in our product automatically by using our public API.



    Infrastructure as code (Iac)


    In our case, product quality is not the responsibility of QA Engineers but the entire team. Load tests are just one of the quality assurance tools. It’s great if the team understands that it is important to test new changes under load. To start thinking about it, the team has to take responsibility for the production environment’s well-being. Here we are helped by the principles of DevOps culture, which we started to implement in our work.

    But to start thinking about conducting load tests is only the first step. The team will not be able to create thorough test cases without understanding the structure of the production environment. We encountered this problem when we began to set up the process of conducting load tests in teams. At that time, the teams had no means of understanding the production environment, so it was difficult for them to work on the design of the tests. There were two reasons for that: the lack of up-to-date documentation or somebody who keeps the whole schema of the production environment in their head and the dramatic increase in the development team’s size.

    The Infrastructure-as-Code approach, which we now use in the development team, can help the team to understand the production environment.

    What we have already achieved using that approach:

    • Everything must be automated and ready to be launched at any moment. This significantly reduces the recovery time in case of an accident in the data center and allows us to keep the right amount of relevant test environments;
    • Reasonable savings: when we can, we deploy environments using OpenStack to replace expensive platforms like AWS;
    • Teams create load tests on their own because they understand the production environment;
    • Code replaces the documentation, so there is no need to update the documentation continually; it is always complete and up-to-date;
    • No need for a dedicated narrow-field expert to do ordinary tasks. Any engineer can figure out the code;
    • Having a clear production environment structure makes it much easier to plan investigative load tests, like chaos monkey testing or long memory leak tests.

    We want to extend this approach beyond creating the infrastructure to support various tools. For example, we have successfully converted the database test that I talked about in the previous article to code. Thanks to this, instead of a pre-prepared site, we have a set of scripts that we can use to create, in seven minutes, a fully configured environment in an empty AWS account and start the test. For the same reason, we are now looking closely at Gatling, which is positioned by its authors as a “Load test as code” tool.

    Infrastructure-as-Code entails a similar approach to the infrastructure testing and to all new scripts written by the team to create an infrastructure for new features. Tests must cover all this. There are various testing frameworks, such as Molecule. There are also tools for chaos monkey testing, paid tools for AWS, Pumba for Docker, etc. They will allow us to solve different types of problems:

    • How can we check, in case one of the AWS instances fails, that the load is rebalanced among the remaining instances and that the service survives this sudden redirection of requests?
    • How can we simulate slow network connections, disconnects, and other technical problems that should not break the logic of the service’s infrastructure?

    We plan to solve these problems soon.

    Conclusions


    • Do not waste your time on manual infrastructure orchestration. It is better to automate these actions to more reliably control all environments, including the production environment;
    • Dynamic scaling significantly reduces the cost of maintaining the production and large test environments while also reducing the human factor effect on scaling;
    • You don’t have to have a separate monitoring system for tests. Instead, use an existing system from the production environment;
    • Load test reports must be automatically collected in one place and have a uniform look. This will allow you to compare them and analyze the changes quickly;
    • Load testing will become a normal process in the company as teams start to feel responsible for the well-being of the production environment;
    • Load tests are infrastructure tests. If the load test was finished successfully, maybe it was miswritten. To validate the correctness of the test, you have to have a thorough understanding of the production environment. The teams should have the means to understand the production environment by themselves. We solve this problem using the IaC approach;
    • Scripts that create the infrastructure also require testing like any other code.

    P.S.: This article was first published on Medium.
    Miro
    Online collaborative whiteboard platform

    Similar posts

    Comments 0

    Only users with full accounts can post comments. Log in, please.