The finding, the fix — no queries Full audit with all SQL and provenance

Observability·Field report·7 min read·25 May 2026

When Databricks anomaly detection fires, what does the system table actually say?

The email said orders_daily completeness Unhealthy. Nine minutes later the on-call ran three lines of SQL against system.data_quality_monitoring.table_results and saw completeness.daily_row_count.last_value = 412 against min_predicted_value = 9400. The gap was always there — two struct levels deep, in a row the email pretended did not exist.

Darshan Singh

Crosshire trial workspaceDatabricks on AWSMay 2026

412vs 9,400

the gap, two struct levels deep in table_results
completeness.daily_row_count.last_value vs min_predicted_value
Detection latency: 7 min (completeness) · 12 min (freshness)

In this report

The row the email pretends does not exist
The bench behind the row
Inside the row from §1
The three-line query, with the landmine
The same row, viewed from the UI

Provenance

2monitors

2tables

14 dseed window

7 / 12detect min

412vs 9,400

Platform: Databricks on AWS, Unity-Catalog enabled, Lakehouse Monitoring on; Data Quality Monitoring UI alerts (Beta) opted in
Features used: Schema-level anomaly detection (Public Preview, Feb 2026); DQM UI alerts (Beta, Mar 2026)
Bench tables: main.bench.orders_daily (completeness, 14-day seed); main.bench.events_hourly (freshness, hourly commits)
System surface: system.data_quality_monitoring.table_results · alerts configured in the DQM UI, routed to email + Slack incoming webhook
Cost basis: Not applicable; this is a correctness & latency probe, not a billing one

Controlled-trial workspace 03:00 incident Bench numbers, reproducible

01The row the email pretends does not exist

The email said Unhealthy. The row said 412 against 9,400.

03:07 UTC. A Slack card on the phone said orders_daily · completeness · Unhealthy. Nine minutes of staring at that card produced exactly one new piece of information: it was still Unhealthy. The on-call SSH’d into a notebook, ran three lines against system.data_quality_monitoring.table_results, and the answer was on screen in eighteen seconds: completeness.daily_row_count.last_value = 412 against a min_predicted_value of 9,400. The email had been carrying nothing the whole nine minutes.

Most Databricks anomaly-detection posts on the practitioner web stop at the dashboard screenshot — a green-then-red sparkline with the word Unhealthy overlaid in a coral pill. None of them carry the gap. The gap was always there, two struct levels deep, in the row the email pretends does not exist. This note is that row: the corrected struct path, the on-call query with the column-name landmine the docs hide, and a measured detection latency on a controlled-trial workspace.

What the on-call gets · what they can act on

Signal	Where it lands	Carries the gap?	Triage minutes
Email subject & one-line body	Inbox	No — names the monitor, not the metric	12
Slack incoming-webhook card	#data-ops	Partial — metric name, not values	9
`table_results` row (SQL)	System schema, queryable from any cluster	Yes — `last_value` vs `min_predicted_value` in one row	2
DQM UI dashboard	Workspace, behind two clicks	Yes — but a laptop-only surface	5

The screenshot is what goes in the post-mortem. The row is what goes in the runbook. One is a story; the other is a contract.

The rest of this note is the contract. The bench that made the break unambiguous, the row the system table wrote, the query that pulls it (with the one column name that has to be right or nothing returns), and the UI alert that fires on the same write but never quotes the row.

02The bench behind the row

Before that query returned `412`, the break had to be unambiguous.

Before the three-line query from §1 could return anything worth waking somebody for, the bench had to make the breakage impossible to read any other way. Two tables on a trial workspace, both governed by Lakehouse Monitoring with schema-level anomaly detection (Public Preview, Feb 2026) turned on. orders_daily seeded with 14 days of synthetic rows, ~10,000 per calendar day, low variance. events_hourly seeded with hourly commits over the same window. The monitors:

The bench · two monitors, two failure modes

Table	Monitor type	What it watches	Break injected
`main.bench.orders_daily`	Completeness	Daily row count vs 14-day seed band	Insert 412 rows for day D (vs ~10,000 expected); commit
`main.bench.events_hourly`	Freshness	Minutes since last commit; alert if > 90 min	Stop the hourly producer at T₀; wait

Neither break is subtle. The completeness break is a row-count plunge of more than an order of magnitude versus the seed band; the freshness break is a hard stop. That is deliberate. The bench is not testing the detector’s sensitivity — that belongs in a separate note. It is making detection so easy that any latency measured is a lower bound on what to expect in the wild.

First break: the row-count plunge. The seed has ~10,000 rows for every prior day; we insert 412 for today and commit. The completeness monitor is the only thing watching, and its band sits between 9,400 and 10,800.

Inject the completeness breakmain.bench.orders_dailysql

-- 14-day seed already exists with ~10,000 rows/day. Today: inject a plunge.
INSERT INTO main.bench.orders_daily
SELECT *
FROM   bench.synth.orders_template
WHERE  order_date = current_date()
LIMIT  412;   -- expected ~10,000, injecting ~ 412

COMMIT;
-- Note the wall-clock to the second. This is T_break.

Wall-clock at COMMIT — recorded to the second — is T_break. The detector now owes us an Unhealthy row.

Second break: the freshness producer. The hourly workflow has a 90-minute threshold; pausing it puts a hard stop on last_value inside freshness.commit_freshness and lets wall-clock do the rest.

Stop the freshness producerevents_hourlybash

# Workflow `events_hourly_producer` runs every 60 min and COMMITs one
# batch into main.bench.events_hourly. We pause it; the threshold is 90 min.
databricks jobs cancel-all-runs --job-id $EVENTS_HOURLY_JOB_ID
databricks jobs reset --job-id $EVENTS_HOURLY_JOB_ID \
  --new-settings '{"schedule":{"pause_status":"PAUSED"}}'
# Record wall-clock. This is T_break for freshness.

Two T_break values on the wall. Whatever Lakehouse Monitoring writes into table_results from here on is the response to one of these two commits.

03Inside the row from §1

Back to the row. The gap is two struct levels deep.

Back to the row from §1 — the one the on-call pulled at 03:16 UTC after nine minutes of an unhelpful Slack card. Here is what was inside it. The shape catches a lot of readers out: table_results has four top-level struct columns — freshness, completeness, downstream_impact, root_cause_analysis — and the actual / expected numbers live nested inside the first two. total_row_count and daily_row_count are not top-level columns; they are measures inside completeness. The discriminator is not which struct is populated — it is the nested status field on each measure (Healthy, Unhealthy, or Unknown).

The completeness row from the bench, captured at first appearance. JSON serialisation is the Databricks default; field order is preserved for readability:

Completeness row, capturedsystem.data_quality_monitoring.table_resultsjson

{
  "catalog_name": "main",
  "schema_name":  "bench",
  "table_name":   "orders_daily",
  "completeness": {
    "daily_row_count": {
      "status":               "Unhealthy",
      "last_value":           412,
      "min_predicted_value":  9400,
      "max_predicted_value": 10800
    },
    "total_row_count": {
      "status": "Healthy"
    }
  },
  "downstream_impact": {
    "impact_level": 2,
    "num_downstream_tables":           4,
    "num_queries_on_affected_tables":  31
  },
  "root_cause_analysis": {
    "upstream_jobs": [
      { "job_id": "812944",
        "run_page_url": "https://<workspace>/jobs/812944/runs/47" }
    ]
  },
  "event_time": "2026-05-18T03:07:00Z"
}

The number that justifies the 03:00 wake-up is completeness.daily_row_count.last_value = 412 against min_predicted_value = 9400. That is the whole post in two struct levels. The 14-day seed window concept lives in the prose and the docs, not as a column — there is no seed_window_days field on the row.

And the freshness row from the same run. commit_freshness is an array of structs (one entry per measured commit), and the “147 minutes since last commit” quoted in the email body is derived — current_timestamp() - last_value. The 90-minute threshold from the producer’s configuration shows up as the model’s predicted_value for the next commit, not a literal threshold column.

Freshness row, capturedsystem.data_quality_monitoring.table_resultsjson

{
  "catalog_name": "main",
  "schema_name":  "bench",
  "table_name":   "events_hourly",
  "freshness": {
    "commit_freshness": [
      {
        "status":           "Unhealthy",
        "error_code":       "FRESHNESS_EXCEEDED",
        "last_value":       "2026-05-18T00:45:00Z",
        "predicted_value":  "2026-05-18T01:45:00Z"
      }
    ]
  },
  "event_time": "2026-05-18T03:12:00Z"
}

At event_time = 03:12 UTC, with last_value = 00:45 UTC, the gap is 147 minutes. That number is one subtraction the on-call does in the SQL editor; the row itself just carries the timestamp.

The four fields, in plain English

The nested status on each measure — not the top-level struct — is the discriminator. Every measure (daily_row_count, total_row_count, commit_freshness, percent_null) carries its own Healthy / Unhealthy / Unknown. The Unknown value is the surprise: it fires when the seed window has not filled yet, which can read like a false alarm on freshly-monitored tables.
<struct>.<measure>.last_value + min_predicted_value / max_predicted_value — the actual-vs-expected pair, two struct levels deep. For freshness the last_value is a timestamp and the “minutes since commit” gap is derived in SQL; for completeness it is an integer row count and the band is a literal pair.
downstream_impact.impact_level — an integer 0–4 scale that combines the dependent-table count with whether anything downstream is itself Unhealthy. The on-call uses it to decide whether to wake a second person; the counts (num_downstream_tables, num_queries_on_affected_tables) are there for the post-mortem.
root_cause_analysis.upstream_jobs[0].run_page_url — the field the on-call clicks at 03:00. Unity Catalog lineage resolves the most-likely upstream job and writes its run page URL directly into the row. One link, no hunting.

The email body is “a monitor became unhealthy.” The row is “orders_daily.completeness.daily_row_count.last_value is 412 against a min predicted of 9,400.” The first sentence wakes you up. The second tells you what to do. — Field report, §3

04The three-line query, with the landmine

The query from the dek. And the column name the docs hide.

The three-line query from the dek — the one that turned nine minutes of staring at Slack into eighteen seconds of having the gap on screen — is below. There is one place the copy-paste breaks for new readers, and it is the column name: system.data_quality_monitoring.table_results uses catalog_name and schema_name, not table_catalog / table_schema like several other system tables. A query with the wrong names returns zero rows with no error. At 03:00 this is the kind of difference that costs a quarter of an hour.

The on-call query · copy-pastesystem.data_quality_monitoring.table_resultssql

-- Replace catalog/schema or drop the predicates to widen the search.
SELECT
  table_name,
  completeness,
  freshness,
  downstream_impact,
  root_cause_analysis,
  event_time
FROM   system.data_quality_monitoring.table_results
WHERE  catalog_name = '<your_catalog>'
  AND  schema_name  = '<your_schema>'
  AND  (completeness.daily_row_count.status = 'Unhealthy'
    OR completeness.total_row_count.status = 'Unhealthy'
    OR exists(freshness.commit_freshness,
              x -> x.status = 'Unhealthy'))
ORDER BY event_time DESC
LIMIT  10;

That returns the ten most recent rows with at least one Unhealthy measure in completeness or freshness. The three struct columns come back as nested values; in the SQL editor each one is one expand-arrow away from the gap.

For the two check types on the bench, the on-call wants two sharper queries — the ones that turn the nested row into a single number per output column. The completeness gap is a straight projection of two scalar fields:

Completeness · the gap, in one rowcompleteness.daily_row_countsql

SELECT
  table_name,
  completeness.daily_row_count.last_value          AS actual_rows,
  completeness.daily_row_count.min_predicted_value AS expected_min,
  completeness.daily_row_count.max_predicted_value AS expected_max,
  completeness.daily_row_count.status              AS status
FROM   system.data_quality_monitoring.table_results
WHERE  completeness.daily_row_count.status = 'Unhealthy';

That returned a single row: orders_daily, actual_rows = 412, expected_min = 9,400, expected_max = 10,800, status = Unhealthy. The gap is a subtraction in the reader’s head; the row carries the operands.

Freshness is harder, and this is the part the docs gloss over. freshness.commit_freshness is an array, so the last-value timestamp comes out of element zero, and the “147 minutes since commit” quoted in the email body has to be computed in SQL with timestampdiff. Array indexing inside a struct path; not pretty, but it is the only way:

Freshness · the gap, derivedfreshness.commit_freshnesssql

SELECT
  table_name,
  freshness.commit_freshness[0].last_value AS last_commit_ts,
  timestampdiff(MINUTE,
                freshness.commit_freshness[0].last_value,
                current_timestamp())       AS minutes_since_commit,
  freshness.commit_freshness[0].status     AS status
FROM   system.data_quality_monitoring.table_results
WHERE  freshness.commit_freshness[0].status = 'Unhealthy';

Run on the bench at 03:16 UTC, this returned events_hourly, last_commit_ts = 00:45 UTC, minutes_since_commit = 151, status = Unhealthy. (Detection itself fired four minutes earlier at event_time = 03:12 UTC with 147 minutes of gap; the on-call query just runs whenever the on-call runs it.) Total wall-clock from opening the SQL editor to having both gaps on screen: 18 seconds.

Before you can read this, you need the grant

Default access is account admins only. system.data_quality_monitoring is one of the system schemas that has to be enabled per metastore, and reads are restricted to account admins by default. The on-call needs an explicit USE SCHEMA + SELECT grant on the schema (or membership in a group that has them) before the query above returns anything. Quietly the most common reason a runbook query “returns no rows” on a healthy workspace.

The discipline points

The model never produces a number. Every value above — last_value, min_predicted_value, minutes_since_commit, event_time — comes from the table_results row as written by the monitor, or is derived from one of its scalar fields in SQL. No LLM, no inference, no “here’s what it probably says.”
A human ratifies everything. The on-call reads the row, decides whether the gap is real or seasonal, and either pages or stands down. The alert opens the conversation; it does not close it.

05The same row, viewed from the UI

The row from §1 also surfaces a UI alert — but only if the wiring is right.

The same row that surfaced the gap in §1 also fires the UI alert. They share the write: one append to table_results, two surfaces. The UI surface is configured outside SQL. As of the Beta release in March 2026 there is no Terraform resource for the alert routing yet (the monitor itself does have one). The path is short: Catalog Explorer → the table → Quality tab → Create alert. Three things to set, and one new affordance in Catalog Explorer worth knowing about.

Where each setting lives

Condition. “Status becomes Unhealthy” on the selected monitor. The single-monitor case is the only one that maps cleanly to one alert.
Threshold. Optional. For the freshness monitor on the bench, the alert fires when the model’s predicted_value is exceeded by the observed commit cadence; there is no separate threshold knob in the UI for the row count case — the band is the seed-band edge.
Routing. Email recipients and/or a Slack incoming-webhook URL. Both are first-class; both fire on the same row append to table_results.

Wired correctly, the alert payload that lands in email or Slack is the Databricks default summary — monitor name, table name, status, timestamp, and a deep link back to the DQM tab. It does not include the last_value or min_predicted_value values; for those, the on-call goes to the row. The post above is, in effect, the case for treating the row as the alert and the email as the doorbell.

There is one new piece of UI worth flagging: in the March 2026 Public-Preview update, Catalog Explorer started badging tables in the tree with a health indicator (a small coloured dot) whenever any nested measure is Unhealthy or Unknown. It reads off the same status fields. It is the first surface where a browsing on-call sees the breakage without opening the Quality tab.

The UI alert is one click to create and one screen to read. The row in the system table is what makes the on-call useful. Both ship from the same write; only one of them carries the gap.

Not in this post · future field notes

The percent_null measure (Beta). A third nested measure that lives inside the completeness struct and watches the per-column null rate against a learned baseline. Mechanics are a separate note; the struct path slots into the query in §4 with one more OR on the WHERE.
Joining root_cause_analysis.upstream_jobs. The full pattern for resolving the upstream-job link to the actual run record and the lineage edges that produced it — including what the row carries when Unity Catalog lineage is partial.
Catalog Explorer health badges, end to end. Public Preview as of 13 March 2026; what the colours mean, what they read off, and whether they cache (they do).

From our audit

Most Databricks workspaces we audit have one monitor configured per platform team, two dashboards bookmarked, and no runbook that touches the system table — usually because the account-admin grant on system.data_quality_monitoring was never pulled through. A Crosshire audit ships the grant, the on-call query, the two struct-path queries (completeness and freshness, with the array-indexed timestamp gap), and the Slack/email wiring for your top-ten production tables — with a human reviewing every threshold before it ships. You keep the queries, the alert configuration, and the runbook.

Start a conversation →

Sources

Databricks docs · Lakehouse Monitoring — monitor types, seed-window mechanics, and the schema-level anomaly detection Preview overview.
Databricks docs · Data Quality Monitoring (Beta) — UI walkthrough, alert creation, and supported routing destinations.
Databricks docs · system.data_quality_monitoring.table_results — column-level reference for the row this note dissects.
Databricks docs · System tables, governance and grants — access model the on-call query above depends on.
Databricks docs · Unity Catalog data lineage — the source the counts in downstream_impact are derived from; join system.access.table_lineage to recover the actual table list.

· · ·

The bench numbers (412 against 9,400–10,800; 147 minutes since commit; detection latencies of 7 and 12 minutes) come from a Crosshire trial workspace on Databricks AWS with Lakehouse Monitoring schema-level anomaly detection (Public Preview, Feb 2026) enabled. The struct paths, column names, and on-call queries are from the May 2026 Databricks docs for system.data_quality_monitoring.table_results and reproducible against any workspace with the schema enabled and the grants pulled through. — Crosshire

Darshan Singh

writes Crosshire Journal · crosshire.ch · May 2026

Crosshire Journal

Field reports on data, compute, and the unglamorous decisions that shape engineering teams. Made in EU. Cited evidence, GDPR-native.

Observability·Quick fix·4 min read·25 May 2026

The Databricks alert wakes you. The row, two struct levels deep, tells you what to do.

The Slack card said orders_daily completeness Unhealthy. Nine minutes later, one query into system.data_quality_monitoring.table_results returned daily_row_count.last_value = 412 against min_predicted_value = 9400. The gap was always there — just nested two struct levels inside a row most practitioner posts never show.

Darshan Singh

Crosshire trial workspaceDatabricks · AWSMay 2026

412vs 9,400

the gap, in completeness.daily_row_count
last_value vs min_predicted_value, two struct levels deep
Detection: 7 min (completeness) · 12 min (freshness)

Same alert, two surfaces. The email names the monitor. The row gives the gap.

Provenance · what we ran

1Two bench tables on a trial workspace: orders_daily (completeness, 14-day seed) and events_hourly (hourly commits, freshness watched). Both governed by Lakehouse Monitoring with schema-level anomaly detection (Public Preview, Feb 2026).

2Injected two unambiguous breaks: a row-count plunge into orders_daily (412 rows vs ~10,000 expected), and a hard pause of the events_hourly producer. Recorded T_break to the second.

3Polled system.data_quality_monitoring.table_results — with catalog_name, not table_catalog — until the nested status on each measure flipped to Unhealthy. Detection: 7 min (completeness), 12 min (freshness). The gap came out of one nested struct path; the email never carried it.

Controlled-trial workspace 03:00 incident Bench numbers, reproducible

01The problem at 03:00

The email is a doorbell. It does not carry the gap.

Lakehouse Monitoring alerts (Beta, March 2026) land an email and a Slack card the moment any nested measure inside the row flips to Unhealthy. Neither payload carries the offending number. The on-call gets the table name, the monitor name, and a timestamp. The decision — page a second person or stand down — needs the gap, and the gap is somewhere else.

02Why the row hides it

The gap is two struct levels deep. That is why nobody shows it.

Every Unhealthy state writes a row into system.data_quality_monitoring.table_results. The row has four top-level struct columns — freshness, completeness, downstream_impact, root_cause_analysis — and the actual-vs-predicted pair lives nested inside the first two. completeness.daily_row_count.last_value and completeness.daily_row_count.min_predicted_value are not column names you will guess: they are two struct levels deep, with the nested status field as the discriminator (Healthy / Unhealthy / Unknown). The dashboard screenshot is easier to post than the path; that is the entire reason practitioner posts skip it.

03The fix

One query into the runbook. And one column name that has to be right.

Paste this into the runbook beside the alert. The one trap is the column name: this system table uses catalog_name and schema_name, not the table_catalog / table_schema that several other system tables expose. Wrong names return zero rows with no error — the kind of thing that costs a quarter of an hour at 03:00.

The on-call query · the canonical consumer SQLsystem.data_quality_monitoring.table_resultssql

SELECT
  table_name,
  completeness.daily_row_count.last_value          AS actual_rows,
  completeness.daily_row_count.min_predicted_value AS expected_min,
  completeness.daily_row_count.status              AS completeness_status,
  freshness.commit_freshness[0].last_value         AS last_commit_ts,
  freshness.commit_freshness[0].status             AS freshness_status,
  event_time
FROM   system.data_quality_monitoring.table_results
WHERE  catalog_name = '<your_catalog>'
  AND  schema_name  = '<your_schema>'
ORDER BY event_time DESC
LIMIT  10;

On the bench, this returned both rows in eighteen seconds. orders_daily with actual_rows = 412, expected_min = 9,400; events_hourly with last_commit_ts = 00:45 UTC. The DQM UI returns the same data; it costs two clicks and a browser tab. The runbook stops needing one.

One caveat the docs underplay: the system.data_quality_monitoring schema is admin-restricted by default. The on-call needs an explicit grant on the schema before the query above returns anything. Pull that through once, when the runbook lands.

Want the receipts?

The long version unpacks three things this short can’t.

The full JSON of both rows. Completeness and freshness, field by field, with the array-indexed commit_freshness[0] path that makes the “147 minutes since commit” gap a timestampdiff call rather than a column read.
root_cause_analysis.upstream_jobs[0].run_page_url. The field the on-call clicks at 03:00 — Unity Catalog lineage writing the most-likely upstream job link directly into the row.
The UI wiring, end to end. Catalog Explorer → Quality tab → Create alert → email + Slack routing — with the three settings that matter, the March 2026 health-indicator badges in Catalog Explorer, and the two pieces there is no Terraform for yet.

Sources

Darshan Singh

writes Crosshire Journal · crosshire.ch · May 2026

Two-minute field fixes from the same audits as our long-form Journal. One number, one fix, one result you can verify.

Crosshire Quick

When Databricks anomaly detection fires, what does the system table actually say?

The email said Unhealthy. The row said 412 against 9,400.

Before that query returned 412, the break had to be unambiguous.

Back to the row. The gap is two struct levels deep.

The query from the dek. And the column name the docs hide.

The row from §1 also surfaces a UI alert — but only if the wiring is right.

The Databricks alert wakes you. The row, two struct levels deep, tells you what to do.

The email is a doorbell. It does not carry the gap.

The gap is two struct levels deep. That is why nobody shows it.

One query into the runbook. And one column name that has to be right.

Journal

Crosshire

Legal

Before that query returned `412`, the break had to be unambiguous.