How we made python pytest suites 8.5× faster / Habr

My name is Anatoly Bobunov, and I work as a Software Development Engineer in Test - or SDET for short - at EXANTE. When I joined one of our projects, I discovered that several of our test suites took more than an hour to run - painfully slow, to the point where running them for every merge request was simply unrealistic. We wanted fast feedback on each commit, but at that speed, it just wasn’t going to happen.

Eventually, through a series of small but precise improvements, I managed to speed things up to 8.5× faster, without rewriting the tests from scratch. In this article, I’ll walk through the bottlenecks we found and how we fixed them.

Why slow tests hurt teams

Slow automated tests directly affect both developer productivity and testing efficiency. The longer the feedback loop, the more motivation and focus drop. And beyond mere inconvenience, slow tests create organisational and psychological friction.

Context switching. A developer pushes code, triggers the pipeline, and moves to another task because the wait is too long. When the results finally arrive, it’s painful to return to the old context.
Huge merge requests (MRs). When tests run forever, people tend to batch changes into large MRs “to save time.”
This makes code review slower, riskier, and more error-prone.
Loss of trust in tests. When Continuous Integration (CI) takes hours, tests stop feeling like a tool and start feeling like a bottleneck.

Over time, the technical issue becomes a productivity issue. Releases slow down, regressions surface late, and the team loses momentum. That’s exactly what we faced — and we knew we needed to bring back a sense of a “live” development cycle.

Where the optimisation started

Like in many young projects, the priority was coverage and speed of delivery, not long-term test health. Quick solutions eventually turn into structural problems.

By the time I joined, some suites took 20–25 minutes on average and could exceed an hour on bad days. Our goal was simple but ambitious: to run tests on every commit in every merge request.

A quick analysis revealed multiple architectural issues:
- slow and inconsistent test data setup;
- huge monolithic test files preventing parallelization;
- hardcoded timeouts and sleep() calls;
- error-handling patterns that triggered redundant retries and waits.

After refactoring the first two suites and reviewing patterns across the repo, I set a target: no suite should run longer than three minutes.

Measuring test performance: what was actually slow

Before optimisation, we needed data.

I built a small custom Pytest hook to track execution time for every test and file, and print aggregated stats after each run. It also worked well with pytest-xdist, letting us see timing per worker.

To understand performance trends over time, I pushed run metrics into InfluxDB and visualised them in Grafana via dashboards - average suite duration, distribution across workers, and optimisation trends. This gave the entire team visibility into improvements.

Parallelising tests with pytest-xdist: the first big win

The obvious first step was to run tests across multiple processes using pytest-xdist.

xdist dynamically balances tests across workers using --dist=load by default. But our tests often shared data within a file, so we couldn’t parallelise at the test level - only at the file level.

That's why I used:

pytest tests/your_suite -n auto --dist loadfile

This ensures all tests within one file run on the same worker.

However, two major problems surfaced:
1. Uneven test distribution. Some files had dozens of quick tests; others had only a couple of heavy scenarios. That led to idle workers while one worker processed a giant file.
2. Shared state and non-unique test data. Many tests reused the same data objects, causing collisions when run in parallel.

Fixing these required architectural changes in the test organisation and data setup.

Breaking down monolithic test suites

Our test base grew over many years. Files were organised by business logic, resulting in some huge 1500-line files and some small ones.

I targeted the large files first - splitting them into smaller logical units dramatically improved workload distribution across workers.

While refactoring, we also audited the tests themselves. With input from QA teams, we removed outdated scenarios and split “do-everything” tests into smaller, clearer ones. This made tests easier to maintain, easier to parallelize, and more predictable.

After these iterations, the test structure became much cleaner and much easier to distribute.

From sleep() to smart waiting: cutting run time dramatically

Hardcoded sleeps were one of the biggest hidden slowdowns.
Removing a single sleep(30) often turned a multi-minute test into a multi-second one.

But removing waits blindly caused failures, especially when interacting with external services.

The solution was to replace all rigid sleeps with intelligent retry-based waiting.

A teammate built a set of universal decorators for retry logic. Testers wrote atomic wait_* functions and simply applied the appropriate decorators.

This gave us:
- controlled retry logic;
- custom backoff strategies;
- failure-fast behavior on unexpected responses;
- minimal total wait time.

def retry_if_code(
    status_code: int,
    text: str | None = None,
    timeout: float = 5.0,
    time_step: float = 0.5,
    *,
    backoff: float = 1.0,
    max_time_step: float | None = None,
):
    """
    Retry until an expected HTTP status code appears in an AssertionError message.

    Args:
        status_code: Expected HTTP code that should appear in the assertion message.
        text: Optional substring that must also be present in the message.
        timeout: Maximum total time to wait (in seconds).
        time_step: Initial sleep interval between retries.
        backoff: Multiplier to increase the delay after each failed attempt (1.0 = constant).
        max_time_step: Optional cap for the sleep interval.

    Example:
        @retry_if_code(202, timeout=10, time_step=0.5, backoff=2.0, max_time_step=4.0)
        def wait_for_accepted(...):
            assert r.code == 202, f"{r.code}: {r.data}"
    """
    def retry_for_code_decorator(f: Callable):
        def func_with_retries(*args, **kwargs):
            end = time() + timeout
            last_msg = "..."
            pause = time_step
            while time() < end:
                try:
                    return f(*args, **kwargs)
                except AssertionError as err:
                    last_msg = err.args[0] if err.args else "Assertion failed ..."
                    # Fail fast if the message does not contain the expected code or text
                    if str(status_code) not in str(last_msg):
                        raise RetryException(f"Unexpected status code: {last_msg}")
                    if text and text not in str(last_msg):
                        raise RetryException(f"Unexpected error description: {last_msg}")
                    sleep(pause)
                    pause *= backoff
                    if max_time_step is not None:
                        pause = min(pause, max_time_step)
            raise AssertionError(f"Timeout reached; expected {status_code}. Last message: {last_msg}")
        return func_with_retries
    return retry_for_code_decorator


@retry_if_code(404, timeout=database_replication_timeout)
@step("Wait cash conversion settings")
def wait_cash_conversion_settings(client: Core, account: str) -> CashConversionSettings:
    r = client.get_cash_conversion_settings(account)
    assert r.is200, f"{r.code}: {r.data}"
    return CashConversionSettings.from_json(r.data)

As a result, tests started waiting exactly as long as necessary, no more.

Fixing non-unique test data: the hidden enemy of parallel runs

Shared test data rarely caused issues in single-threaded runs, but in parallel mode the collisions became obvious.

Sometimes testers copied constants and reused them across files, causing multiple tests to operate on the same objects.

We introduced new rules:
- shared entities must live in a clear hierarchy: global → suite → module;
- each test file must use unique data;
- isolation should be at file level, not at test level (too costly otherwise).

This made parallel runs predictable and greatly simplified debugging.

Presetup: automating data preparation

Historically, test data came from manual QA artifacts. Eventually this became a major pain point.

A teammate proposed automating data preparation - and that’s how Presetup was born.

@pytest.fixture(scope="session", autouse=True)
def _pre_setup_global_entities():
    ...

@pytest.fixture(scope="package", autouse=True)
def _pre_setup_core_entities():
    ...

@pytest.fixture(scope="module", autouse=True)
def _pre_setup_module():
    ...

The idea:
- check if required entities exist;
- create or update them if needed;
- integrate this logic into Pytest fixtures at various scopes (session, package, module);
- run nightly jobs to restore data to a clean canonical state.

Presetup eliminated routine data preparation, improved consistency, and reduced flaky tests caused by inconsistent environments.

Results: 88% faster test runs

When we first set out to speed up our test runs, the average suite took around 20–25 minutes. And if tests began to fail, things got even worse: retries, rigid waits, and cascading timeouts easily pushed the total runtime toward GitLab CI’s two-hour limit.

After a series of focused improvements, the average runtime stabilised at roughly two minutes per suite. That’s an 88% reduction - from 17 minutes down to two - and an overall speedup of 8.5×. More importantly, runs became far more predictable, and tracking down issues became noticeably faster.

Practical Ways to Speed Up Your Pytest Suites

Leverage pytest-xdist and pick the distribution strategy that fits your project.
Balance the workload across xdist workers to avoid idle processes.
Replace sleep() with targeted wait functions (wait_*) backed by retry decorators.
Ensure test data is unique and isolated, so parallel runs don’t collide.

With the faster pipeline, developers started receiving feedback on merge requests almost immediately. That reduced context switching and naturally led to smaller, cleaner commits. Test failures became rarer and easier to reason about - fewer flakes, quicker debugging, and a renewed sense of trust in the test suite.

CI pipelines also became shorter and more stable, which sped up releases and made code reviews calmer and more predictable. In short, the development cycle began to feel “alive” again.

If you’re curious about how our GitLab CI journey started and how we built the foundation of our framework, I covered those early steps in a previous article.