When Databricks anomaly detection fires, what does the system table actually say?
The email said orders_daily completeness Unhealthy. Nine
minutes later the on-call ran three lines of SQL against
system.data_quality_monitoring.table_results and saw
completeness.daily_row_count.last_value = 412
against min_predicted_value = 9400.
The gap was always there — two struct levels deep, in a row
the email pretended did not exist.
table_resultscompleteness.daily_row_count.last_value vs min_predicted_valueDetection latency: 7 min (completeness) · 12 min (freshness)
- Platform
- Databricks on AWS, Unity-Catalog enabled, Lakehouse Monitoring on; Data Quality Monitoring UI alerts (Beta) opted in
- Features used
- Schema-level anomaly detection (Public Preview, Feb 2026); DQM UI alerts (Beta, Mar 2026)
- Bench tables
main.bench.orders_daily(completeness, 14-day seed);main.bench.events_hourly(freshness, hourly commits)- System surface
system.data_quality_monitoring.table_results· alerts configured in the DQM UI, routed to email + Slack incoming webhook- Cost basis
- Not applicable; this is a correctness & latency probe, not a billing one
The email said Unhealthy. The row said 412 against 9,400.
03:07 UTC. A Slack card on the phone said
orders_daily · completeness · Unhealthy.
Nine minutes of staring at that card produced exactly one new
piece of information: it was still Unhealthy. The on-call
SSH’d into a notebook, ran three lines against
system.data_quality_monitoring.table_results, and
the answer was on screen in eighteen seconds:
completeness.daily_row_count.last_value = 412
against a min_predicted_value of
9,400. The email had been carrying nothing the
whole nine minutes.
Most Databricks anomaly-detection posts on the practitioner web stop at the dashboard screenshot — a green-then-red sparkline with the word Unhealthy overlaid in a coral pill. None of them carry the gap. The gap was always there, two struct levels deep, in the row the email pretends does not exist. This note is that row: the corrected struct path, the on-call query with the column-name landmine the docs hide, and a measured detection latency on a controlled-trial workspace.
| Signal | Where it lands | Carries the gap? | Triage minutes |
|---|---|---|---|
| Email subject & one-line body | Inbox | No — names the monitor, not the metric | 12 |
| Slack incoming-webhook card | #data-ops | Partial — metric name, not values | 9 |
table_results row (SQL) |
System schema, queryable from any cluster | Yes — last_value vs min_predicted_value in one row |
2 |
| DQM UI dashboard | Workspace, behind two clicks | Yes — but a laptop-only surface | 5 |
The screenshot is what goes in the post-mortem. The row is what goes in the runbook. One is a story; the other is a contract.
The rest of this note is the contract. The bench that made the break unambiguous, the row the system table wrote, the query that pulls it (with the one column name that has to be right or nothing returns), and the UI alert that fires on the same write but never quotes the row.
Before that query returned 412, the break had to be unambiguous.
Before the three-line query from §1 could return anything
worth waking somebody for, the bench had to make the breakage
impossible to read any other way. Two tables on a trial
workspace, both governed by Lakehouse Monitoring with
schema-level anomaly detection (Public Preview,
Feb 2026) turned on. orders_daily seeded
with 14 days of synthetic rows, ~10,000 per calendar day, low
variance. events_hourly seeded with hourly
commits over the same window. The monitors:
| Table | Monitor type | What it watches | Break injected |
|---|---|---|---|
main.bench.orders_daily |
Completeness | Daily row count vs 14-day seed band | Insert 412 rows for day D (vs ~10,000 expected); commit |
main.bench.events_hourly |
Freshness | Minutes since last commit; alert if > 90 min | Stop the hourly producer at T0; wait |
Neither break is subtle. The completeness break is a row-count plunge of more than an order of magnitude versus the seed band; the freshness break is a hard stop. That is deliberate. The bench is not testing the detector’s sensitivity — that belongs in a separate note. It is making detection so easy that any latency measured is a lower bound on what to expect in the wild.
First break: the row-count plunge. The seed has ~10,000 rows for every prior day; we insert 412 for today and commit. The completeness monitor is the only thing watching, and its band sits between 9,400 and 10,800.
-- 14-day seed already exists with ~10,000 rows/day. Today: inject a plunge. INSERT INTO main.bench.orders_daily SELECT * FROM bench.synth.orders_template WHERE order_date = current_date() LIMIT 412; -- expected ~10,000, injecting ~ 412 COMMIT; -- Note the wall-clock to the second. This is T_break.
Wall-clock at COMMIT — recorded to the
second — is Tbreak. The detector now
owes us an Unhealthy row.
Second break: the freshness producer. The hourly workflow has
a 90-minute threshold; pausing it puts a hard stop on
last_value inside freshness.commit_freshness
and lets wall-clock do the rest.
# Workflow `events_hourly_producer` runs every 60 min and COMMITs one # batch into main.bench.events_hourly. We pause it; the threshold is 90 min. databricks jobs cancel-all-runs --job-id $EVENTS_HOURLY_JOB_ID databricks jobs reset --job-id $EVENTS_HOURLY_JOB_ID \ --new-settings '{"schedule":{"pause_status":"PAUSED"}}' # Record wall-clock. This is T_break for freshness.
Two Tbreak values on the wall. Whatever
Lakehouse Monitoring writes into table_results
from here on is the response to one of these two commits.
Back to the row. The gap is two struct levels deep.
Back to the row from §1 — the one the on-call
pulled at 03:16 UTC after nine minutes of an unhelpful Slack
card. Here is what was inside it. The shape catches a lot of
readers out: table_results has four top-level
struct columns —
freshness, completeness,
downstream_impact, root_cause_analysis
— and the actual / expected numbers live nested
inside the first two. total_row_count and
daily_row_count are not top-level columns; they
are measures inside completeness. The
discriminator is not which struct is populated — it is
the nested status field on each measure
(Healthy, Unhealthy, or
Unknown).
The completeness row from the bench, captured at first appearance. JSON serialisation is the Databricks default; field order is preserved for readability:
{
"catalog_name": "main",
"schema_name": "bench",
"table_name": "orders_daily",
"completeness": {
"daily_row_count": {
"status": "Unhealthy",
"last_value": 412,
"min_predicted_value": 9400,
"max_predicted_value": 10800
},
"total_row_count": {
"status": "Healthy"
}
},
"downstream_impact": {
"impact_level": 2,
"num_downstream_tables": 4,
"num_queries_on_affected_tables": 31
},
"root_cause_analysis": {
"upstream_jobs": [
{ "job_id": "812944",
"run_page_url": "https://<workspace>/jobs/812944/runs/47" }
]
},
"event_time": "2026-05-18T03:07:00Z"
}
The number that justifies the 03:00 wake-up is
completeness.daily_row_count.last_value = 412
against min_predicted_value = 9400. That is the
whole post in two struct levels. The
14-day seed window concept lives in the prose
and the docs, not as a column — there is no
seed_window_days field on the row.
And the freshness row from the same run.
commit_freshness is an array of structs
(one entry per measured commit), and the “147 minutes
since last commit” quoted in the email body is
derived —
current_timestamp() - last_value. The 90-minute
threshold from the producer’s configuration shows up as
the model’s predicted_value for the next
commit, not a literal threshold column.
{
"catalog_name": "main",
"schema_name": "bench",
"table_name": "events_hourly",
"freshness": {
"commit_freshness": [
{
"status": "Unhealthy",
"error_code": "FRESHNESS_EXCEEDED",
"last_value": "2026-05-18T00:45:00Z",
"predicted_value": "2026-05-18T01:45:00Z"
}
]
},
"event_time": "2026-05-18T03:12:00Z"
}
At event_time = 03:12 UTC, with
last_value = 00:45 UTC, the gap is 147 minutes.
That number is one subtraction the on-call does in the SQL
editor; the row itself just carries the timestamp.
- The nested
statuson each measure — not the top-level struct — is the discriminator. Every measure (daily_row_count,total_row_count,commit_freshness,percent_null) carries its ownHealthy/Unhealthy/Unknown. TheUnknownvalue is the surprise: it fires when the seed window has not filled yet, which can read like a false alarm on freshly-monitored tables. <struct>.<measure>.last_value+min_predicted_value/max_predicted_value— the actual-vs-expected pair, two struct levels deep. For freshness thelast_valueis a timestamp and the “minutes since commit” gap is derived in SQL; for completeness it is an integer row count and the band is a literal pair.downstream_impact.impact_level— an integer 0–4 scale that combines the dependent-table count with whether anything downstream is itself Unhealthy. The on-call uses it to decide whether to wake a second person; the counts (num_downstream_tables,num_queries_on_affected_tables) are there for the post-mortem.root_cause_analysis.upstream_jobs[0].run_page_url— the field the on-call clicks at 03:00. Unity Catalog lineage resolves the most-likely upstream job and writes its run page URL directly into the row. One link, no hunting.
The email body is “a monitor became unhealthy.” The row is “orders_daily.completeness.daily_row_count.last_valueis412against a min predicted of9,400.” The first sentence wakes you up. The second tells you what to do. — Field report, §3
The query from the dek. And the column name the docs hide.
The three-line query from the dek — the one that turned
nine minutes of staring at Slack into eighteen seconds of
having the gap on screen — is below. There is one place
the copy-paste breaks for new readers, and it is the column
name: system.data_quality_monitoring.table_results
uses catalog_name and schema_name,
not table_catalog / table_schema
like several other system tables. A query with the wrong
names returns zero rows with no error. At 03:00 this is the
kind of difference that costs a quarter of an hour.
-- Replace catalog/schema or drop the predicates to widen the search. SELECT table_name, completeness, freshness, downstream_impact, root_cause_analysis, event_time FROM system.data_quality_monitoring.table_results WHERE catalog_name = '<your_catalog>' AND schema_name = '<your_schema>' AND (completeness.daily_row_count.status = 'Unhealthy' OR completeness.total_row_count.status = 'Unhealthy' OR exists(freshness.commit_freshness, x -> x.status = 'Unhealthy')) ORDER BY event_time DESC LIMIT 10;
That returns the ten most recent rows with at least one Unhealthy measure in completeness or freshness. The three struct columns come back as nested values; in the SQL editor each one is one expand-arrow away from the gap.
For the two check types on the bench, the on-call wants two sharper queries — the ones that turn the nested row into a single number per output column. The completeness gap is a straight projection of two scalar fields:
SELECT table_name, completeness.daily_row_count.last_value AS actual_rows, completeness.daily_row_count.min_predicted_value AS expected_min, completeness.daily_row_count.max_predicted_value AS expected_max, completeness.daily_row_count.status AS status FROM system.data_quality_monitoring.table_results WHERE completeness.daily_row_count.status = 'Unhealthy';
That returned a single row: orders_daily,
actual_rows = 412,
expected_min = 9,400,
expected_max = 10,800,
status = Unhealthy. The gap is a subtraction in
the reader’s head; the row carries the operands.
Freshness is harder, and this is the part the docs gloss over.
freshness.commit_freshness is an array, so the
last-value timestamp comes out of element zero, and the
“147 minutes since commit” quoted in the email
body has to be computed in SQL with
timestampdiff. Array indexing inside a struct
path; not pretty, but it is the only way:
SELECT table_name, freshness.commit_freshness[0].last_value AS last_commit_ts, timestampdiff(MINUTE, freshness.commit_freshness[0].last_value, current_timestamp()) AS minutes_since_commit, freshness.commit_freshness[0].status AS status FROM system.data_quality_monitoring.table_results WHERE freshness.commit_freshness[0].status = 'Unhealthy';
Run on the bench at 03:16 UTC, this returned
events_hourly,
last_commit_ts = 00:45 UTC,
minutes_since_commit = 151,
status = Unhealthy. (Detection itself fired four
minutes earlier at event_time = 03:12 UTC with
147 minutes of gap; the on-call query just runs whenever the
on-call runs it.) Total wall-clock from opening the SQL
editor to having both gaps on screen: 18 seconds.
- Default access is account admins only.
system.data_quality_monitoringis one of the system schemas that has to be enabled per metastore, and reads are restricted to account admins by default. The on-call needs an explicitUSE SCHEMA+SELECTgrant on the schema (or membership in a group that has them) before the query above returns anything. Quietly the most common reason a runbook query “returns no rows” on a healthy workspace.
- The model never produces a number. Every value above —
last_value,min_predicted_value,minutes_since_commit,event_time— comes from thetable_resultsrow as written by the monitor, or is derived from one of its scalar fields in SQL. No LLM, no inference, no “here’s what it probably says.” - A human ratifies everything. The on-call reads the row, decides whether the gap is real or seasonal, and either pages or stands down. The alert opens the conversation; it does not close it.
The row from §1 also surfaces a UI alert — but only if the wiring is right.
The same row that surfaced the gap in §1 also fires the
UI alert. They share the write: one append to
table_results, two surfaces. The UI surface is
configured outside SQL. As of the Beta release in
March 2026 there is no Terraform resource for the alert
routing yet (the monitor itself does have one). The path is
short: Catalog Explorer → the table →
Quality tab → Create alert. Three things to set,
and one new affordance in Catalog Explorer worth knowing about.
- Condition. “Status becomes Unhealthy” on the selected monitor. The single-monitor case is the only one that maps cleanly to one alert.
- Threshold. Optional. For the freshness monitor on the bench, the alert fires when the model’s
predicted_valueis exceeded by the observed commit cadence; there is no separate threshold knob in the UI for the row count case — the band is the seed-band edge. - Routing. Email recipients and/or a Slack incoming-webhook URL. Both are first-class; both fire on the same row append to
table_results.
Wired correctly, the alert payload that lands in email or
Slack is the Databricks default summary — monitor name,
table name, status, timestamp, and a deep link back to the
DQM tab. It does not include the last_value
or min_predicted_value values; for those, the
on-call goes to the row. The post above is, in effect, the
case for treating the row as the alert and the email as the
doorbell.
There is one new piece of UI worth flagging: in the March 2026
Public-Preview update, Catalog Explorer started badging tables
in the tree with a health indicator (a small coloured dot)
whenever any nested measure is Unhealthy or Unknown.
It reads off the same status fields. It is the
first surface where a browsing on-call sees the breakage
without opening the Quality tab.
The UI alert is one click to create and one screen to read. The row in the system table is what makes the on-call useful. Both ship from the same write; only one of them carries the gap.
- The
percent_nullmeasure (Beta). A third nested measure that lives inside the completeness struct and watches the per-column null rate against a learned baseline. Mechanics are a separate note; the struct path slots into the query in §4 with one moreORon the WHERE. - Joining
root_cause_analysis.upstream_jobs. The full pattern for resolving the upstream-job link to the actual run record and the lineage edges that produced it — including what the row carries when Unity Catalog lineage is partial. - Catalog Explorer health badges, end to end. Public Preview as of 13 March 2026; what the colours mean, what they read off, and whether they cache (they do).
system.data_quality_monitoring
was never pulled through. A Crosshire audit ships the grant,
the on-call query, the two struct-path queries (completeness
and freshness, with the array-indexed timestamp gap), and the
Slack/email wiring for your top-ten production tables —
with a human reviewing every threshold before it ships. You
keep the queries, the alert configuration, and the runbook.
- Databricks docs · Lakehouse Monitoring — monitor types, seed-window mechanics, and the schema-level anomaly detection Preview overview.
- Databricks docs · Data Quality Monitoring (Beta) — UI walkthrough, alert creation, and supported routing destinations.
- Databricks docs ·
system.data_quality_monitoring.table_results— column-level reference for the row this note dissects. - Databricks docs · System tables, governance and grants — access model the on-call query above depends on.
- Databricks docs · Unity Catalog data lineage — the source the counts in
downstream_impactare derived from; joinsystem.access.table_lineageto recover the actual table list.
The bench numbers (412 against 9,400–10,800; 147 minutes
since commit; detection latencies of 7 and 12 minutes) come from
a Crosshire trial workspace on Databricks AWS with Lakehouse
Monitoring schema-level anomaly detection (Public Preview,
Feb 2026) enabled. The struct paths, column names, and
on-call queries are from the May 2026 Databricks docs for
system.data_quality_monitoring.table_results and
reproducible against any workspace with the schema enabled and
the grants pulled through. — Crosshire
The Databricks alert wakes you. The row, two struct levels deep, tells you what to do.
The Slack card said orders_daily completeness Unhealthy.
Nine minutes later, one query into
system.data_quality_monitoring.table_results returned
daily_row_count.last_value = 412
against min_predicted_value = 9400.
The gap was always there — just nested two struct levels
inside a row most practitioner posts never show.
completeness.daily_row_countlast_value vs min_predicted_value, two struct levels deepDetection: 7 min (completeness) · 12 min (freshness)
1Two bench tables on a trial workspace: orders_daily (completeness, 14-day seed) and events_hourly (hourly commits, freshness watched). Both governed by Lakehouse Monitoring with schema-level anomaly detection (Public Preview, Feb 2026).
2Injected two unambiguous breaks: a row-count plunge into orders_daily (412 rows vs ~10,000 expected), and a hard pause of the events_hourly producer. Recorded Tbreak to the second.
3Polled system.data_quality_monitoring.table_results — with catalog_name, not table_catalog — until the nested status on each measure flipped to Unhealthy. Detection: 7 min (completeness), 12 min (freshness). The gap came out of one nested struct path; the email never carried it.
The email is a doorbell. It does not carry the gap.
Lakehouse Monitoring alerts (Beta, March 2026) land an
email and a Slack card the moment any nested measure inside
the row flips to Unhealthy. Neither payload
carries the offending number. The on-call gets the table name,
the monitor name, and a timestamp. The decision — page
a second person or stand down — needs the gap, and the
gap is somewhere else.
The gap is two struct levels deep. That is why nobody shows it.
Every Unhealthy state writes a row into
system.data_quality_monitoring.table_results.
The row has four top-level struct columns —
freshness, completeness,
downstream_impact,
root_cause_analysis — and the
actual-vs-predicted pair lives nested inside the
first two. completeness.daily_row_count.last_value
and completeness.daily_row_count.min_predicted_value
are not column names you will guess: they are two struct
levels deep, with the nested status field as the
discriminator (Healthy / Unhealthy /
Unknown). The dashboard screenshot is easier to
post than the path; that is the entire reason practitioner
posts skip it.
One query into the runbook. And one column name that has to be right.
Paste this into the runbook beside the alert. The one trap is
the column name: this system table uses
catalog_name and schema_name, not
the table_catalog / table_schema
that several other system tables expose. Wrong names return
zero rows with no error — the kind of thing that costs
a quarter of an hour at 03:00.
SELECT table_name, completeness.daily_row_count.last_value AS actual_rows, completeness.daily_row_count.min_predicted_value AS expected_min, completeness.daily_row_count.status AS completeness_status, freshness.commit_freshness[0].last_value AS last_commit_ts, freshness.commit_freshness[0].status AS freshness_status, event_time FROM system.data_quality_monitoring.table_results WHERE catalog_name = '<your_catalog>' AND schema_name = '<your_schema>' ORDER BY event_time DESC LIMIT 10;
On the bench, this returned both rows in eighteen seconds.
orders_daily with actual_rows = 412,
expected_min = 9,400; events_hourly
with last_commit_ts = 00:45 UTC. The DQM UI
returns the same data; it costs two clicks and a browser tab.
The runbook stops needing one.
One caveat the docs underplay: the
system.data_quality_monitoring schema is
admin-restricted by default. The on-call needs an explicit
grant on the schema before the query above returns anything.
Pull that through once, when the runbook lands.
- The full JSON of both rows. Completeness and freshness, field by field, with the array-indexed
commit_freshness[0]path that makes the “147 minutes since commit” gap atimestampdiffcall rather than a column read. root_cause_analysis.upstream_jobs[0].run_page_url. The field the on-call clicks at 03:00 — Unity Catalog lineage writing the most-likely upstream job link directly into the row.- The UI wiring, end to end. Catalog Explorer → Quality tab → Create alert → email + Slack routing — with the three settings that matter, the March 2026 health-indicator badges in Catalog Explorer, and the two pieces there is no Terraform for yet.