Quitting the Samurai Path: How EXANTE Is Changing Its Infrastructure, or How We Failed at Going Cloud Native / Хабр

Hi! My name is Evgeny, I’m Lead DevOps at EXANTE. In this article, I’ll share what happened when we tried to quickly become Cloud Native and shoved everything into Kubernetes. You’ll learn what mistakes we made, how they affected development speed and releases, and why we now look at infrastructure in a completely different way.

Who We Are

EXANTE is a brokerage company with its own trading platform. Through it, you can trade on more than 50 markets worldwide from a single multi-currency account. Our main clients are professional traders and institutional investors.

Our services vary drastically: from lightweight applications to heavy legacy systems with stateful sets consuming dozens of CPUs and hundreds of gigabytes of memory, startup times over 15 minutes, and huge caches on PVCs.

All this imposes constraints and makes infrastructure decisions not abstract experiments but a matter of real engineering necessity. The core of our business is millisecond-level availability, and we must ensure and control fast response through infrastructure.

It’s normal practice for us when any technically savvy person, like a developer or a tester, comes up with infrastructure ideas. Sometimes these discussions turn into heated debates, but more often they push us forward. Every decision is well-argued: we discuss, analyze, calculate value, and evaluate consequences.

Expectations: Becoming Cloud Native

At the end of 2023, management and I formulated our goals:

Our product becomes Cloud Native.
Infrastructure can be easily deployed in any cloud.
Financials become transparent — we want to understand what we pay for and where it’s more expensive.
Time to market for new features gets shorter.

At that time, we had a real technology zoo: OVH and HTZ, Avela, managed services in GCP and AWS. Each technology came with its own rules and constraints, so managing this zoo was like trying to tame a pack of wild animals.

Next, I’ll talk about GCP, because that’s where we made most of our mistakes.

Reality

We wanted to accelerate time-to-market. We already had Terraform, Chef for VMs, and Flux CD for GitOps. On paper, it sounded beautiful: migrate services into GKE and profit instantly. So we decided: even legacy goes into GKE.

Among those services were true monsters:

Statefulsets with 36 CPUs and 70 GB RAM.
Startup time 15+ minutes, caches of 25–30 GB downloaded at startup to PVC, plus the need to periodically clean PVC.
Pure protobuf streaming, hacks, and crutches.

We thought: “It’s fine, Kubernetes will handle it.” The reality looked like this:

Time to market slowed down. Dev and QA teams weren’t ready to rebuild pipelines for GKE.
Releases got complicated. Release engineers were shocked by the manual steps and long startup times.
Resource utilization dropped. Explosive consumption at startup followed by a sharp drop.
GitOps broke down. Flux CD suffocated if more than six heavy services were deployed at once. Queues, freezes, and manual interventions.
Incident analysis got harder. New infrastructure brought new pain for devs and support.
Communication broke. DevOps demanded changes in service code, while developers reasonably replied: “We have a product roadmap, we can’t spare half the team for refactoring just for infrastructure.”

What We Learned

Our push for Cloud Native turned into a samurai spirit for the sake of a samurai spirit. We chased fast results, but without a long-term plan, those quick wins turned into hard fails.

Here’s what we realized:

Strategy matters more than hype. Kubernetes is a tool, not the goal. Not every legacy service belongs in it. Some are simpler and cheaper to keep outside, with automated deployments.
Infrastructure ≠ DevOps toy. It’s part of the product. Changing it requires alignment with development and business plans. DevOps and dev must synchronize: migrating services should go hand in hand with their evolution.

The mistakes and painful experiments at the start forced us to stop and rethink. In the end, failure became a growth point. We now clearly understand: Cloud Native doesn’t mean shoving everything into Kubernetes — it’s about finding balance between technical drive, common sense, and real business value.

Infrastructure 2.0: Rebooting the Platform

We approached the Infrastructure 2.0 project with greater maturity and pragmatism. It has deadlines, a roadmap, responsible owners, and most importantly, tasks with real value for the product and the team. It’s not migration for the sake of migration, but a coherent development strategy where every decision is justified. We’re currently in the process of rolling it out.

1. Network and Foundation

We started with the most vulnerable spot: networking. Previously, any failure rippled across the entire system. That’s why we’re implementing resilient interconnects between OVH and clouds (AWS/GCP) with BGP/OSPF/BFD.

Step two: a distributed Consul Cluster: a “single directory of services.” Thanks to it, Kafka, infrastructure, and other services will find each other through consistent DNS names instead of hacks.

2. DevOps Platform

Flux CD showed its limits. We’re introducing ArgoCD for transparency and predictability. Developers will see deployments, testers will roll back changes, and ApplicationSet will allow spinning up temporary environments in minutes.

Istio wasn’t chosen randomly: we need a single entry point, mTLS by default, and network-level access control.

We’re adding Karpenter for automatic node management, so the system itself knows when and how many resources are required.

We’re preparing Nomad for legacy workloads we don’t want to break. It’s a compromise: containers without Kubernetes pain.

3. Requirements for Development

We’re introducing unified rules: containers, stateless design, JSON logs, Prometheus metrics. Every service must have a passport and a runbook. It might seem like bureaucracy, but in practice, it’s a way out of chaos. Developers will follow rails, and DevOps won’t drown in exceptions.

4. Observability and Finance

We’re implementing full metric collection and FinOps dashboards. Pilot reports already show that some services consume many times more resources than we thought. This allows us to build cost models in advance.

5. Security

Security has always been with us, but now we’re leveling it up:

mTLS everywhere (via Istio).
Centralized Vault.
Base Docker images reviewed by SecOps.
Images scanned by AquaSecurity.

6. Operations and Incidents

We’re preparing a DR environment in GCP. Secrets and GitOps states will sync automatically. Documentation and runbooks are being written in parallel, so teams can work by instruction, not improvisation.

What Infrastructure 2.0 Will Bring

The transition to Infrastructure 2.0 is not only engineering work but also a cultural transformation: transparency, maturity, focus on real value. It’s a systematic change in how we design, evolve, and maintain infrastructure.

Its key principles:

Speed and flexibility. We can now spin up feature environments quickly, test hypotheses, and roll back changes without weeks of approvals. This directly shortens time-to-market and gives development more freedom.
Reliability and resilience. We have a hot standby in GCP, automatic secret, and GitOps state synchronization. Even serious incidents won’t paralyze the business.
Cost transparency. FinOps dashboards show which services and environments really consume resources, helping us make balanced decisions — where to optimize and where to invest.
Developer convenience. Standards for metrics, logs, environment variables, and universal Helm charts reduce chaos and lower the entry threshold. A new service can now be launched along ready-made rails, not by reinventing the wheel.
Security by default. mTLS, centralized Vault, vetted base images, and secret rotation policies close many vulnerabilities before they ever reach production.
Knowledge accumulation. Runbooks, documentation, and regular post-mortems form a knowledge base that helps newcomers onboard quickly and prevents experienced engineers from stepping on the same rakes.

Most importantly, we no longer see infrastructure as a disposable resource. For us, it’s a full-fledged product that must be convenient for business, development, support, and SecOps. We’re confident it will help us release new features faster, minimize downtime costs, reduce expenses through transparent resource management, and, most importantly, create a platform that grows with the business instead of holding it back.

And yes, Kubernetes is still an important part of this picture. But now it’s not the lone path of a samurai — it’s part of a balanced ecosystem where every technology has its place and serves a common goal.

Follow our progress in future articles. There are many more experiments, failures, and victories ahead!

Quitting the Samurai Path: How EXANTE Is Changing Its Infrastructure, or How We Failed at Going Cloud Native