Crosshire
The finding, the fix — no queries Full audit with all SQL and provenance
Observability·Field report·7 min read·25 May 2026

When Databricks anomaly detection fires, what does the system table actually say?

The email said orders_daily completeness Unhealthy. Nine minutes later the on-call ran three lines of SQL against system.data_quality_monitoring.table_results and saw completeness.daily_row_count.last_value = 412 against min_predicted_value = 9400. The gap was always there — two struct levels deep, in a row the email pretended did not exist.

412vs 9,400
the gap, two struct levels deep in table_results
completeness.daily_row_count.last_value vs min_predicted_value
Detection latency: 7 min (completeness) · 12 min (freshness)
Provenance
2monitors
2tables
14 dseed window
7 / 12detect min
412vs 9,400
Platform
Databricks on AWS, Unity-Catalog enabled, Lakehouse Monitoring on; Data Quality Monitoring UI alerts (Beta) opted in
Features used
Schema-level anomaly detection (Public Preview, Feb 2026); DQM UI alerts (Beta, Mar 2026)
Bench tables
main.bench.orders_daily (completeness, 14-day seed); main.bench.events_hourly (freshness, hourly commits)
System surface
system.data_quality_monitoring.table_results · alerts configured in the DQM UI, routed to email + Slack incoming webhook
Cost basis
Not applicable; this is a correctness & latency probe, not a billing one
Controlled-trial workspace 03:00 incident Bench numbers, reproducible
01The row the email pretends does not exist

The email said Unhealthy. The row said 412 against 9,400.

03:07 UTC. A Slack card on the phone said orders_daily · completeness · Unhealthy. Nine minutes of staring at that card produced exactly one new piece of information: it was still Unhealthy. The on-call SSH’d into a notebook, ran three lines against system.data_quality_monitoring.table_results, and the answer was on screen in eighteen seconds: completeness.daily_row_count.last_value = 412 against a min_predicted_value of 9,400. The email had been carrying nothing the whole nine minutes.

Most Databricks anomaly-detection posts on the practitioner web stop at the dashboard screenshot — a green-then-red sparkline with the word Unhealthy overlaid in a coral pill. None of them carry the gap. The gap was always there, two struct levels deep, in the row the email pretends does not exist. This note is that row: the corrected struct path, the on-call query with the column-name landmine the docs hide, and a measured detection latency on a controlled-trial workspace.

What the on-call gets · what they can act on
Signal Where it lands Carries the gap? Triage minutes
Email subject & one-line body Inbox No — names the monitor, not the metric 12
Slack incoming-webhook card #data-ops Partial — metric name, not values 9
table_results row (SQL) System schema, queryable from any cluster Yeslast_value vs min_predicted_value in one row 2
DQM UI dashboard Workspace, behind two clicks Yes — but a laptop-only surface 5
The screenshot is what goes in the post-mortem. The row is what goes in the runbook. One is a story; the other is a contract.

The rest of this note is the contract. The bench that made the break unambiguous, the row the system table wrote, the query that pulls it (with the one column name that has to be right or nothing returns), and the UI alert that fires on the same write but never quotes the row.

02The bench behind the row

Before that query returned 412, the break had to be unambiguous.

Before the three-line query from §1 could return anything worth waking somebody for, the bench had to make the breakage impossible to read any other way. Two tables on a trial workspace, both governed by Lakehouse Monitoring with schema-level anomaly detection (Public Preview, Feb 2026) turned on. orders_daily seeded with 14 days of synthetic rows, ~10,000 per calendar day, low variance. events_hourly seeded with hourly commits over the same window. The monitors:

The bench · two monitors, two failure modes
Table Monitor type What it watches Break injected
main.bench.orders_daily Completeness Daily row count vs 14-day seed band Insert 412 rows for day D (vs ~10,000 expected); commit
main.bench.events_hourly Freshness Minutes since last commit; alert if > 90 min Stop the hourly producer at T0; wait

Neither break is subtle. The completeness break is a row-count plunge of more than an order of magnitude versus the seed band; the freshness break is a hard stop. That is deliberate. The bench is not testing the detector’s sensitivity — that belongs in a separate note. It is making detection so easy that any latency measured is a lower bound on what to expect in the wild.

First break: the row-count plunge. The seed has ~10,000 rows for every prior day; we insert 412 for today and commit. The completeness monitor is the only thing watching, and its band sits between 9,400 and 10,800.

Inject the completeness breakmain.bench.orders_dailysql
-- 14-day seed already exists with ~10,000 rows/day. Today: inject a plunge.
INSERT INTO main.bench.orders_daily
SELECT *
FROM   bench.synth.orders_template
WHERE  order_date = current_date()
LIMIT  412;   -- expected ~10,000, injecting ~ 412

COMMIT;
-- Note the wall-clock to the second. This is T_break.

Wall-clock at COMMIT — recorded to the second — is Tbreak. The detector now owes us an Unhealthy row.

Second break: the freshness producer. The hourly workflow has a 90-minute threshold; pausing it puts a hard stop on last_value inside freshness.commit_freshness and lets wall-clock do the rest.

Stop the freshness producerevents_hourlybash
# Workflow `events_hourly_producer` runs every 60 min and COMMITs one
# batch into main.bench.events_hourly. We pause it; the threshold is 90 min.
databricks jobs cancel-all-runs --job-id $EVENTS_HOURLY_JOB_ID
databricks jobs reset --job-id $EVENTS_HOURLY_JOB_ID \
  --new-settings '{"schedule":{"pause_status":"PAUSED"}}'
# Record wall-clock. This is T_break for freshness.

Two Tbreak values on the wall. Whatever Lakehouse Monitoring writes into table_results from here on is the response to one of these two commits.

03Inside the row from §1

Back to the row. The gap is two struct levels deep.

Back to the row from §1 — the one the on-call pulled at 03:16 UTC after nine minutes of an unhelpful Slack card. Here is what was inside it. The shape catches a lot of readers out: table_results has four top-level struct columns — freshness, completeness, downstream_impact, root_cause_analysis — and the actual / expected numbers live nested inside the first two. total_row_count and daily_row_count are not top-level columns; they are measures inside completeness. The discriminator is not which struct is populated — it is the nested status field on each measure (Healthy, Unhealthy, or Unknown).

The completeness row from the bench, captured at first appearance. JSON serialisation is the Databricks default; field order is preserved for readability:

Completeness row, capturedsystem.data_quality_monitoring.table_resultsjson
{
  "catalog_name": "main",
  "schema_name":  "bench",
  "table_name":   "orders_daily",
  "completeness": {
    "daily_row_count": {
      "status":               "Unhealthy",
      "last_value":           412,
      "min_predicted_value":  9400,
      "max_predicted_value": 10800
    },
    "total_row_count": {
      "status": "Healthy"
    }
  },
  "downstream_impact": {
    "impact_level": 2,
    "num_downstream_tables":           4,
    "num_queries_on_affected_tables":  31
  },
  "root_cause_analysis": {
    "upstream_jobs": [
      { "job_id": "812944",
        "run_page_url": "https://<workspace>/jobs/812944/runs/47" }
    ]
  },
  "event_time": "2026-05-18T03:07:00Z"
}

The number that justifies the 03:00 wake-up is completeness.daily_row_count.last_value = 412 against min_predicted_value = 9400. That is the whole post in two struct levels. The 14-day seed window concept lives in the prose and the docs, not as a column — there is no seed_window_days field on the row.

And the freshness row from the same run. commit_freshness is an array of structs (one entry per measured commit), and the “147 minutes since last commit” quoted in the email body is derivedcurrent_timestamp() - last_value. The 90-minute threshold from the producer’s configuration shows up as the model’s predicted_value for the next commit, not a literal threshold column.

Freshness row, capturedsystem.data_quality_monitoring.table_resultsjson
{
  "catalog_name": "main",
  "schema_name":  "bench",
  "table_name":   "events_hourly",
  "freshness": {
    "commit_freshness": [
      {
        "status":           "Unhealthy",
        "error_code":       "FRESHNESS_EXCEEDED",
        "last_value":       "2026-05-18T00:45:00Z",
        "predicted_value":  "2026-05-18T01:45:00Z"
      }
    ]
  },
  "event_time": "2026-05-18T03:12:00Z"
}

At event_time = 03:12 UTC, with last_value = 00:45 UTC, the gap is 147 minutes. That number is one subtraction the on-call does in the SQL editor; the row itself just carries the timestamp.

The four fields, in plain English
  • The nested status on each measure — not the top-level struct — is the discriminator. Every measure (daily_row_count, total_row_count, commit_freshness, percent_null) carries its own Healthy / Unhealthy / Unknown. The Unknown value is the surprise: it fires when the seed window has not filled yet, which can read like a false alarm on freshly-monitored tables.
  • <struct>.<measure>.last_value + min_predicted_value / max_predicted_value — the actual-vs-expected pair, two struct levels deep. For freshness the last_value is a timestamp and the “minutes since commit” gap is derived in SQL; for completeness it is an integer row count and the band is a literal pair.
  • downstream_impact.impact_level — an integer 0–4 scale that combines the dependent-table count with whether anything downstream is itself Unhealthy. The on-call uses it to decide whether to wake a second person; the counts (num_downstream_tables, num_queries_on_affected_tables) are there for the post-mortem.
  • root_cause_analysis.upstream_jobs[0].run_page_url — the field the on-call clicks at 03:00. Unity Catalog lineage resolves the most-likely upstream job and writes its run page URL directly into the row. One link, no hunting.
The email body is “a monitor became unhealthy.” The row is “orders_daily.completeness.daily_row_count.last_value is 412 against a min predicted of 9,400.” The first sentence wakes you up. The second tells you what to do. — Field report, §3
04The three-line query, with the landmine

The query from the dek. And the column name the docs hide.

The three-line query from the dek — the one that turned nine minutes of staring at Slack into eighteen seconds of having the gap on screen — is below. There is one place the copy-paste breaks for new readers, and it is the column name: system.data_quality_monitoring.table_results uses catalog_name and schema_name, not table_catalog / table_schema like several other system tables. A query with the wrong names returns zero rows with no error. At 03:00 this is the kind of difference that costs a quarter of an hour.

The on-call query · copy-pastesystem.data_quality_monitoring.table_resultssql
-- Replace catalog/schema or drop the predicates to widen the search.
SELECT
  table_name,
  completeness,
  freshness,
  downstream_impact,
  root_cause_analysis,
  event_time
FROM   system.data_quality_monitoring.table_results
WHERE  catalog_name = '<your_catalog>'
  AND  schema_name  = '<your_schema>'
  AND  (completeness.daily_row_count.status = 'Unhealthy'
    OR completeness.total_row_count.status = 'Unhealthy'
    OR exists(freshness.commit_freshness,
              x -> x.status = 'Unhealthy'))
ORDER BY event_time DESC
LIMIT  10;

That returns the ten most recent rows with at least one Unhealthy measure in completeness or freshness. The three struct columns come back as nested values; in the SQL editor each one is one expand-arrow away from the gap.

For the two check types on the bench, the on-call wants two sharper queries — the ones that turn the nested row into a single number per output column. The completeness gap is a straight projection of two scalar fields:

Completeness · the gap, in one rowcompleteness.daily_row_countsql
SELECT
  table_name,
  completeness.daily_row_count.last_value          AS actual_rows,
  completeness.daily_row_count.min_predicted_value AS expected_min,
  completeness.daily_row_count.max_predicted_value AS expected_max,
  completeness.daily_row_count.status              AS status
FROM   system.data_quality_monitoring.table_results
WHERE  completeness.daily_row_count.status = 'Unhealthy';

That returned a single row: orders_daily, actual_rows = 412, expected_min = 9,400, expected_max = 10,800, status = Unhealthy. The gap is a subtraction in the reader’s head; the row carries the operands.

Freshness is harder, and this is the part the docs gloss over. freshness.commit_freshness is an array, so the last-value timestamp comes out of element zero, and the “147 minutes since commit” quoted in the email body has to be computed in SQL with timestampdiff. Array indexing inside a struct path; not pretty, but it is the only way:

Freshness · the gap, derivedfreshness.commit_freshnesssql
SELECT
  table_name,
  freshness.commit_freshness[0].last_value AS last_commit_ts,
  timestampdiff(MINUTE,
                freshness.commit_freshness[0].last_value,
                current_timestamp())       AS minutes_since_commit,
  freshness.commit_freshness[0].status     AS status
FROM   system.data_quality_monitoring.table_results
WHERE  freshness.commit_freshness[0].status = 'Unhealthy';

Run on the bench at 03:16 UTC, this returned events_hourly, last_commit_ts = 00:45 UTC, minutes_since_commit = 151, status = Unhealthy. (Detection itself fired four minutes earlier at event_time = 03:12 UTC with 147 minutes of gap; the on-call query just runs whenever the on-call runs it.) Total wall-clock from opening the SQL editor to having both gaps on screen: 18 seconds.

Before you can read this, you need the grant
  • Default access is account admins only. system.data_quality_monitoring is one of the system schemas that has to be enabled per metastore, and reads are restricted to account admins by default. The on-call needs an explicit USE SCHEMA + SELECT grant on the schema (or membership in a group that has them) before the query above returns anything. Quietly the most common reason a runbook query “returns no rows” on a healthy workspace.
The discipline points
  • The model never produces a number. Every value above — last_value, min_predicted_value, minutes_since_commit, event_time — comes from the table_results row as written by the monitor, or is derived from one of its scalar fields in SQL. No LLM, no inference, no “here’s what it probably says.”
  • A human ratifies everything. The on-call reads the row, decides whether the gap is real or seasonal, and either pages or stands down. The alert opens the conversation; it does not close it.
05The same row, viewed from the UI

The row from §1 also surfaces a UI alert — but only if the wiring is right.

The same row that surfaced the gap in §1 also fires the UI alert. They share the write: one append to table_results, two surfaces. The UI surface is configured outside SQL. As of the Beta release in March 2026 there is no Terraform resource for the alert routing yet (the monitor itself does have one). The path is short: Catalog Explorer → the table → Quality tab → Create alert. Three things to set, and one new affordance in Catalog Explorer worth knowing about.

Where each setting lives
  • Condition. “Status becomes Unhealthy” on the selected monitor. The single-monitor case is the only one that maps cleanly to one alert.
  • Threshold. Optional. For the freshness monitor on the bench, the alert fires when the model’s predicted_value is exceeded by the observed commit cadence; there is no separate threshold knob in the UI for the row count case — the band is the seed-band edge.
  • Routing. Email recipients and/or a Slack incoming-webhook URL. Both are first-class; both fire on the same row append to table_results.

Wired correctly, the alert payload that lands in email or Slack is the Databricks default summary — monitor name, table name, status, timestamp, and a deep link back to the DQM tab. It does not include the last_value or min_predicted_value values; for those, the on-call goes to the row. The post above is, in effect, the case for treating the row as the alert and the email as the doorbell.

There is one new piece of UI worth flagging: in the March 2026 Public-Preview update, Catalog Explorer started badging tables in the tree with a health indicator (a small coloured dot) whenever any nested measure is Unhealthy or Unknown. It reads off the same status fields. It is the first surface where a browsing on-call sees the breakage without opening the Quality tab.

The UI alert is one click to create and one screen to read. The row in the system table is what makes the on-call useful. Both ship from the same write; only one of them carries the gap.
Not in this post · future field notes
  • The percent_null measure (Beta). A third nested measure that lives inside the completeness struct and watches the per-column null rate against a learned baseline. Mechanics are a separate note; the struct path slots into the query in §4 with one more OR on the WHERE.
  • Joining root_cause_analysis.upstream_jobs. The full pattern for resolving the upstream-job link to the actual run record and the lineage edges that produced it — including what the row carries when Unity Catalog lineage is partial.
  • Catalog Explorer health badges, end to end. Public Preview as of 13 March 2026; what the colours mean, what they read off, and whether they cache (they do).
From our audit
Most Databricks workspaces we audit have one monitor configured per platform team, two dashboards bookmarked, and no runbook that touches the system table — usually because the account-admin grant on system.data_quality_monitoring was never pulled through. A Crosshire audit ships the grant, the on-call query, the two struct-path queries (completeness and freshness, with the array-indexed timestamp gap), and the Slack/email wiring for your top-ten production tables — with a human reviewing every threshold before it ships. You keep the queries, the alert configuration, and the runbook.
Start a conversation →
Sources
· · ·

The bench numbers (412 against 9,400–10,800; 147 minutes since commit; detection latencies of 7 and 12 minutes) come from a Crosshire trial workspace on Databricks AWS with Lakehouse Monitoring schema-level anomaly detection (Public Preview, Feb 2026) enabled. The struct paths, column names, and on-call queries are from the May 2026 Databricks docs for system.data_quality_monitoring.table_results and reproducible against any workspace with the schema enabled and the grants pulled through. — Crosshire

D
writes Crosshire Journal · crosshire.ch · May 2026
Crosshire Journal
Field reports on data, compute, and the unglamorous decisions that shape engineering teams. Made in EU. Cited evidence, GDPR-native.
Observability·Quick fix·4 min read·25 May 2026

The Databricks alert wakes you. The row, two struct levels deep, tells you what to do.

The Slack card said orders_daily completeness Unhealthy. Nine minutes later, one query into system.data_quality_monitoring.table_results returned daily_row_count.last_value = 412 against min_predicted_value = 9400. The gap was always there — just nested two struct levels inside a row most practitioner posts never show.

412vs 9,400
the gap, in completeness.daily_row_count
last_value vs min_predicted_value, two struct levels deep
Detection: 7 min (completeness) · 12 min (freshness)
WHAT THE ON-CALL SEES · 03:00 EMAIL / SLACK Monitor unhealthy orders_daily · completeness 2026-05-18 03:07 UTC No actual. No expected. SYSTEM TABLE · ONE QUERY completeness.daily_row_count last_value: 412 min_predicted_value: 9,400 max_predicted_value: 10,800 The gap, two struct levels deep. DOORBELL RECEIPT
Same alert, two surfaces. The email names the monitor. The row gives the gap.
Provenance · what we ran

1Two bench tables on a trial workspace: orders_daily (completeness, 14-day seed) and events_hourly (hourly commits, freshness watched). Both governed by Lakehouse Monitoring with schema-level anomaly detection (Public Preview, Feb 2026).

2Injected two unambiguous breaks: a row-count plunge into orders_daily (412 rows vs ~10,000 expected), and a hard pause of the events_hourly producer. Recorded Tbreak to the second.

3Polled system.data_quality_monitoring.table_results — with catalog_name, not table_catalog — until the nested status on each measure flipped to Unhealthy. Detection: 7 min (completeness), 12 min (freshness). The gap came out of one nested struct path; the email never carried it.

01The problem at 03:00

The email is a doorbell. It does not carry the gap.

Lakehouse Monitoring alerts (Beta, March 2026) land an email and a Slack card the moment any nested measure inside the row flips to Unhealthy. Neither payload carries the offending number. The on-call gets the table name, the monitor name, and a timestamp. The decision — page a second person or stand down — needs the gap, and the gap is somewhere else.

02Why the row hides it

The gap is two struct levels deep. That is why nobody shows it.

Every Unhealthy state writes a row into system.data_quality_monitoring.table_results. The row has four top-level struct columns — freshness, completeness, downstream_impact, root_cause_analysis — and the actual-vs-predicted pair lives nested inside the first two. completeness.daily_row_count.last_value and completeness.daily_row_count.min_predicted_value are not column names you will guess: they are two struct levels deep, with the nested status field as the discriminator (Healthy / Unhealthy / Unknown). The dashboard screenshot is easier to post than the path; that is the entire reason practitioner posts skip it.

03The fix

One query into the runbook. And one column name that has to be right.

Paste this into the runbook beside the alert. The one trap is the column name: this system table uses catalog_name and schema_name, not the table_catalog / table_schema that several other system tables expose. Wrong names return zero rows with no error — the kind of thing that costs a quarter of an hour at 03:00.

The on-call query · the canonical consumer SQLsystem.data_quality_monitoring.table_resultssql
SELECT
  table_name,
  completeness.daily_row_count.last_value          AS actual_rows,
  completeness.daily_row_count.min_predicted_value AS expected_min,
  completeness.daily_row_count.status              AS completeness_status,
  freshness.commit_freshness[0].last_value         AS last_commit_ts,
  freshness.commit_freshness[0].status             AS freshness_status,
  event_time
FROM   system.data_quality_monitoring.table_results
WHERE  catalog_name = '<your_catalog>'
  AND  schema_name  = '<your_schema>'
ORDER BY event_time DESC
LIMIT  10;

On the bench, this returned both rows in eighteen seconds. orders_daily with actual_rows = 412, expected_min = 9,400; events_hourly with last_commit_ts = 00:45 UTC. The DQM UI returns the same data; it costs two clicks and a browser tab. The runbook stops needing one.

One caveat the docs underplay: the system.data_quality_monitoring schema is admin-restricted by default. The on-call needs an explicit grant on the schema before the query above returns anything. Pull that through once, when the runbook lands.

Want the receipts?
The long version unpacks three things this short can’t.
  • The full JSON of both rows. Completeness and freshness, field by field, with the array-indexed commit_freshness[0] path that makes the “147 minutes since commit” gap a timestampdiff call rather than a column read.
  • root_cause_analysis.upstream_jobs[0].run_page_url. The field the on-call clicks at 03:00 — Unity Catalog lineage writing the most-likely upstream job link directly into the row.
  • The UI wiring, end to end. Catalog Explorer → Quality tab → Create alert → email + Slack routing — with the three settings that matter, the March 2026 health-indicator badges in Catalog Explorer, and the two pieces there is no Terraform for yet.
D
writes Crosshire Journal · crosshire.ch · May 2026
Two-minute field fixes from the same audits as our long-form Journal. One number, one fix, one result you can verify.
Crosshire Quick
© 2026 Crosshire Journal · Made in EU Privacy Terms Cookies License Imprint Coffee