
Sometimes your work chat won’t stop flooding with alerts, tests have been red for a month, and no one really knows who’s responsible for what — or what even matters.
Mikkel Dengsø and Petr Janda, together with data teams from hundreds of startups and Fortune 500 companies, put together a Guide to Building High-Quality Data Products. It’s for those who are tired of the chaos and want to bring order to testing, quality management, and incident response.
Here are the key insights and real-world cases.
Escaping the spaghetti lineage
Aiven had more than 900 dbt models. Because of circular dependencies, the lineage turned into a tangled spaghetti mess. Critical calculations like ARR had hundreds of upstream dependencies: any change shook half the stack.
The categories were too broad: Marketing, Sales, Product. When data broke, alerts would simply say Marketing is down. Nobody knew:
What exactly broke?
How critical is it?
Who owns the fix?
Which business processes are affected?
Diagnosis was slow, fixes even slower.
By methodically breaking big buckets like Marketing into smaller products like Marketing Attribution, Aiven sped up error detection and clarified ownership.
Rule 1: think in products, not departments
Split vague categories like Marketing into focused products: Attribution, CLTV, Market Research. That way, when something fails, it’s obvious what exactly broke and how urgent the fix is (attribution = immediate, research = later).
Each product should have a short card kept next to the code: task, owner (team and channel), priority (P1–P4), inputs/outputs, and a list of key tables/models.
Rule 2: keep the right depth
Overly large products don’t help with prioritization. At Aiven they drilled down within domains: inside Marketing they split out Attribution (P1) and Market Research (P3). Suddenly priorities stopped being debatable.
Rule 3: assign ownership like adults
Not “the whole data team” and not “one hero.” Tie ownership to specific teams and their channels, store it in metadata (owner tags, dbt groups), and check in CI that every model has an owner. Escalations then become predictable.
If everything is high priority, nothing is
It’s like with torrents: if you set high priority for all downloads at once, nothing will change, because their relative weight stays the same.
That’s what happened at Lunar, where every team thought their data was the most important. And they were all correct — but only for themselves. Elements of the system can’t rate their own importance.
The solution was simple: once per quarter, leadership across teams started meeting to define the most critical data products for the upcoming period and set clear SLAs, like response times to failures.
Rule 1: set priorities from the top down
If every team marks their data as P1, you can’t manage anything: everything important = nothing important.
How to do it:
Once a quarter, run a “priority council” (C-level + product and data leads).
Build a unified top-10 list of critical data products for the quarter.
Decide by simple criteria: money at risk, customer impact, regulatory risk.
For the rest, assign P2/P3/P4 in descending importance.
Rule 2: tie priorities to actions and deadlines
A P1 tag without deadlines is just a label. Each level must have clear response and recovery goals.
How to do it:
P1 (customer-facing): acknowledgment — 5 min, detection — within 10 min, customer notification — within 30 min, temp workaround immediately, fix at top priority.
P2 (critical exports/background): acknowledgment — 30 min, fix same day.
P3 (analytics/BI): by end of next business day.
P4 (low risk): backlog.
Each P-level needs an assigned owner and on-call.
Rule 3: keep priorities in shape
Priorities tend to drift — everything becomes P1 again.
How to do it:
Limit P1s. For example, no more than 8 P1 products across the company. To add a new one, remove an old one.
Quarterly review. Refresh the top-10 list, check what actually hit revenue and customers.
Upgrade filter. Raise priority only with a business case: money at risk, customer impact, regulatory.
Public board. Table by domain: list of products, P-level, SLA. Visible to everyone.
Don’t test everything — or you’ll drown in noise
At Google and Monzo, early on, they tested every table and column. That meant hundreds of alerts, most of them irrelevant.
A shift in strategy helped: test the sources that everything else depends on. The noise dropped, reliability grew.
Rule 1: tie every test to an action
A test without a clear next step is garbage. Only set checks where it’s known in advance: who reacts, what they do.
Example:
Yesterday’s orders didn’t arrive by 08:15 → owner restarts job, if that fails → switches to backup source and updates status.
Bad test:
Column price has nulls. If there’s no described fix, that alert is just noise.
A test is useful:
If, over a quarter, it led to at least N fixes and had almost no false positives. Otherwise, rewrite or remove it.
Rule 2: test high-impact nodes
Don’t scatter across hundreds of tables. Focus on influence nodes — points many downstream artifacts depend on: product catalogs, customer directories, payment loads, key joins.
Make a shortlist of ~20 top nodes and strengthen tests there: formats, completeness, duplicates, schema stability, boundary values.
A bug in the product catalog breaks prices, stock, and margins across ten reports. One solid test there beats a hundred checks further downstream.
Rule 3: keep tests clean and lean
Tests multiply, age, and start spamming. The team drowns in alerts and stops reacting.
Before adding a test, write a 3-line passport:
Goal (what it catches) → Owner (who fixes) → Action (what to do when triggered). If one is missing, don’t add the test.
Alert budget:
Set a cap, e.g., max 5 alerts/day per team. If exceeded — pause new tests, clean noisy ones.
Monthly review value vs. noise:
Total triggers.
Useful (led to fixes).
False/duplicates.
0 useful in 90 days → delete/redo.
30% false positives → rewrite (threshold, window, condition).
Duplicates → merge.
Exception calendar:
Plan for vendor downtime, maintenance windows, holidays. Adjust freshness thresholds during those times to avoid planned “false” alerts.
Lifecycle:
Draft → Active → Verified → Deprecated → Archived. Mark as Deprecated after 90 days of no value, Archive once replaced.
Example:
Test “Unusually few Monday orders” had 12 triggers, 10 false (holidays). Added calendar + waiting window until 10:00 — now 3 triggers, 3 fixes, 0 false. Usefulness up, noise gone.
Moral:
Fixing downstream errors is like bailing water, fixing source errors is like patching the leak.
Turn risk into trust
Shalion is an e-commerce analytics platform: clients see dashboards and exports they base decisions on. The product is the real-time data.
When data is the product, customer trust is the main KPI. Errors are inevitable: pipelines fail, schemas change, lags grow. The danger isn’t the error itself, but silence, blurred ownership, and a flood of empty alerts.
Here are Shalion’s working rules: how to make alerts actionable, protect team focus, agree on realistic reliability, and measure not just “what works” but “what’s under control.” This way, risk turns into trust.
Rule 1: every alert must lead to action
Each notification is a mini-plan, not a “We’re doomed” scream. An alert should include: what broke, where, who owns it, who’s affected, what to do (links to instructions, dependency chain). Agree on response times: acknowledgment in 5 minutes, escalation in 10. Then alerts get fixed, not just read.
Rule 2: protect team attention
Don’t spam 150 “freshness dropped” alerts if one pipeline failed. Send one root-cause alert listing all impacted marts and the owner. Non-critical stuff should go to niche channels or weekly reviews.
Rule 3: agree on reality in advance
Not everything needs 99.9%. For customer-facing data — yes. For internal marts — often 95% is enough. That’s normal engineering economics: expensive reliability doesn’t pay off where risk is low. Agree with clients on this before incidents.
At Shalion they don’t pretend errors never happen. They find them fast, fix them, and openly tell clients. That way, threats to trust become sources of trust. Hats off.
Rule 4: measure not just “how it works” but “what’s covered”
A green dashboard can hide red sources. If you only look at checks passing in the data mart, you miss how many critical objects aren’t monitored at all — failures will slip through and hit trust.
Track two metrics:
Coverage — share of critical objects you actually monitor.
Quality/SLA — share of checks passing on those objects.
How to read them:
Low Coverage, high Quality → an illusion of stability. First expand coverage to the critical upstream (connectors, key joins), otherwise you’ll miss the next failure.
High Coverage, low Quality → everything is visible, but works poorly. Fix the root causes: align with the source, optimize the pipeline, reduce MTTR, and strengthen tests where things actually break.
Both high → capture the practices and keep the level up.
If Coverage = 60% with Quality = 99%, that’s dangerous comfort: 40% of the chain is invisible. Raise Coverage to 90%, then improve Quality.
If all checked data looks fine, it just means you didn’t check the broken part.
This guide isn’t tied to specific tools — it’s about principles. Works for 5-person teams and large departments alike, regardless of company scale or data maturity.
Save it, share it with colleagues.