Crosshire
The finding, the fix — no queries Full audit with all SQL and provenance
Data quality·Field report·7 min read·25 May 2026

Your grouped email nulls are invisible to your monitoring.

The dashboard tile read 11 % null on customers.email for weeks before the auditor ran the same metric grouped and one bucket emitted 40 %. The number on the tile was correct. The unit was wrong. The WITHIN GROUP clause on Snowflake Data Metric Functions — Enterprise Edition GA at time of writing — is the single DDL change that turns the first number into the second.

40 / 44
nulls in the DE region alone · out of 44 total
Table-level NULL_COUNT: 11 % · pageable? No.
DE-segment NULL_COUNT: 40 % · pageable? Yes.
Provenance
44null emails
400rows scanned
1,000group cap
4regions audited
40DE nulls
Platform
Snowflake Enterprise Edition, EU-Frankfurt
Feature
WITHIN GROUP clause on built-in DMFs (Enterprise Edition GA at time of writing)
Method
Designed test bed seeded with a region-skewed null distribution; DMF attached, results read from SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
Sources
Snowflake docs · Working with Data Metric Functions · Built-in DMFs reference
Cost basis
DMF execution is billed against the configured serverless warehouse; not the subject of this note
Snowflake Enterprise EE Test bed · region-skewed seed EU-Frankfurt · 4 regions
01The 11 % that should have been a page

The dashboard hid the segment. The grouped DMF named it.

We attached a NULL_COUNT DMF to customers.email and watched the dashboard for a fortnight. It read 11 % null — 44 nulls across 400 rows on the test bed, scheduled nightly. Then we added WITHIN GROUP (signup_region). The next run wrote four rows to DATA_QUALITY_MONITORING_RESULTS instead of one. DE was 40 %. The dashboard tile had been correct for weeks. The unit was wrong.

The reason the table-level reading was not pageable is structural. A column-level null count on a table that mixes four regions answers exactly one question: how many nulls are in this column across the whole table? A single segment can run at 100 % failure without the table-level number doing anything dramatic. Forty rows from one country dilute into four hundred across four — the average comes back at 11 %, the on-call rule reads 11 %, nobody wakes up. The page-worthy unit is the worst segment, and a flat NULL_COUNT cannot emit it.

Same metric, two readings · WITHIN GROUP (signup_region)
Reading Null rows Region rows Rate
Table-level (ungrouped) 44 400 11.0 %
DE · grouped 40 100 40.0 %
UK · grouped 2 100 2.0 %
US · grouped 1 100 1.0 %
FR · grouped 1 100 1.0 %
The table average is the wrong unit. The page-worthy unit is the worst segment. An ungrouped DMF cannot emit it; a grouped DMF will not let you miss it.
Null rate · email by signup_region same scan, two queries
0% 15% 30% 45% Table avg 11 % 40 % 2 % 1 % 1 % DE UK US FR EMAIL NULL RATE · CUSTOMERS · SAME NIGHTLY SCAN
One bar carries the entire failure. The dashed line is the table-level reading the ungrouped DMF emits — the line a PagerDuty rule looks at. The DE bar is the line a regional on-call would have wanted to see at 03:00.
Why this matters now
  • The WITHIN GROUP clause for DMFs is GA on the Enterprise Edition at time of writing. Before it landed, the same diagnosis required hand-rolled queries against information_schema and a scheduled task. The grouped DMF replaces both with a single DDL.
  • It works on every built-in DMF that has a sensible per-segment reading: NULL_COUNT, NULL_PERCENT, DUPLICATE_COUNT, UNIQUE_COUNT, ACCEPTED_VALUES, and most user-defined DMFs that are a plain aggregate over a window-free SELECT.
  • It does not work on FRESHNESS, REFERENTIAL_INTEGRITY_COUNT, or any custom DMF whose body contains a CTE, a JOIN, or a window function. Those keep returning a single table-level value.
02Back to the 11 % tile

One clause on the ALTER TABLE, one row per group, per run.

Back to the 11 % tile from §1. The DDL change that flipped it to a regional breakdown is one clause on the same ALTER TABLE … ADD DATA METRIC FUNCTION statement that attached the ungrouped DMF in the first place. Snowflake takes the existing DMF body, applies an implicit GROUP BY over the column(s) you name, and stores one row per group per scheduled measurement in SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS. The DDL we ran on the test bed was this:

Attach a grouped DMFcustomers.email by signup_regionsql
ALTER TABLE <your_table>
  ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.NULL_COUNT
    ON (email)
    WITHIN GROUP (signup_region)
    GROUP LIMIT 1000;

On the test bed this returned Statement executed successfully and the next scheduled measurement wrote four rows to DATA_QUALITY_MONITORING_RESULTS instead of one — one per region, with the segment key in the new GROUP_BY_INFO column §3 unpacks.

Three things to notice. One. The grouping columns are part of the DMF’s attached identity. Snowflake will not let you change them in place — you drop the DMF and re-add it with the new WITHIN GROUP list. The constraint is deliberate: a grouped DMF’s history is only coherent if the grouping does not silently shift under it.

Two. GROUP LIMIT is a hard cap, default 1,000, minimum 1, maximum 1,000 — and the failure mode is loud, not silent. If the column you grouped by has more distinct values than the limit at evaluation time, Snowflake does not write a partial result; the entire evaluation fails with an error, no row is written for that run. On a customer_id column with 50,000 distinct values, attaching a grouped DMF with the default cap produces this:

When the grouping column exceeds GROUP LIMITwarehouse error, whole run failstext
SQL execution error: The number of groups exceeds the
configured GROUP LIMIT of 1000 for this association.
No measurement row was written for this run.

That is the Unknown function of grouped DMFs: the error the team learns the day they group on the wrong column. It is also the reason this is a wall-reminder, not a footnote — you only group on columns whose cardinality you can state out loud. For signup_region on the test bed, that number is four. For customer_id, it is in the tens of thousands and the answer is “don’t.”

Three. Setting WITHIN GROUP on a DMF automatically disables Snowflake’s anomaly-detection scoring for that metric. The per-segment series has its own seasonality and noise floor; running the platform’s built-in anomaly model on a noisy per-group time series would generate more false pages than the grouped DMF was attached to prevent.

The grouping is part of the DMF’s identity. Change the keys, drop and re-add — you do not edit a grouped DMF in place. The history would otherwise be a lie. — Field report, §2
03Read the grouped result

The DE row from §1 has a shape.

The DE row from §1 has a shape. It lives in SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS with one new column that did not exist on the ungrouped runs: GROUP_BY_INFO. It is an ARRAY of OBJECT entries — one per grouping column, in declaration order. For an ungrouped DMF the column is an empty array, so a single ARRAY_SIZE(group_by_info) > 0 predicate is enough to split grouped from ungrouped history on the same metric.

Shape of one row, DE bucketgroup_by_infojson
[
  { "id": "1", "name": "SIGNUP_REGION", "value": "DE" }
]

That is the DE row, in its committed form — the segment key Snowflake records alongside the VALUE column carrying the per-region null count. The read query is a plain projection. VALUE on a grouped row is the metric for that group only; the table-level roll-up sits on the same metric’s ungrouped rows (the ones with ARRAY_SIZE(group_by_info) = 0), so a single WHERE separates the two.

Per-region null counts, latest run firstDATA_QUALITY_MONITORING_RESULTSsql
SELECT
  measurement_time,
  group_by_info[0]:value::STRING AS signup_region,
  value::NUMBER                  AS null_count
FROM   SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
WHERE  table_name = '<your_table>'
  AND  ARRAY_SIZE(group_by_info) > 0
ORDER BY measurement_time DESC;

Four rows came back, ordered by measurement_time desc, one per region. DE first on rate, the other three close enough to the floor that nothing past the first row mattered for the next page-or-don’t-page decision:

Output of the read query, 2026-05-25 03:00:00 UTC measurement
measurement_time signup_region null_count
2026-05-25 03:00:00 UTCDE40
2026-05-25 03:00:00 UTCUK2
2026-05-25 03:00:00 UTCUS1
2026-05-25 03:00:00 UTCFR1

For a multi-key grouping (say, WITHIN GROUP (signup_region, plan_tier)), the array carries one entry per key in the order they were declared. Reach into them by index: group_by_info[0]:value for the region, group_by_info[1]:value for the tier. The order is the order on the ALTER TABLE; renaming a column does not reshuffle it.

04An alert that fires on the broken segment only

The whole point: page the right human at 03:00.

The DE row is one query away from a page. The alerting SQL collapses to one filter on VALUE: a threshold of 10 nulls per region per night fires exactly one row on the test bed — DE — and leaves the other three regions silent. That is the behaviour the table-level metric cannot produce no matter how the threshold is tuned: the average will either be too low to page or too high to be useful.

Alert-only query: only the broken segment(s)fires on threshold breachsql
SELECT
  group_by_info[0]:value::STRING AS signup_region,
  value::NUMBER                  AS null_count
FROM   SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
WHERE  table_name = '<your_table>'
  AND  ARRAY_SIZE(group_by_info) > 0
  AND  value::NUMBER > <your_threshold>
ORDER BY measurement_time DESC;

On the test bed with a threshold of 10: one row returnedsignup_region = DE, null_count = 40. Wired to a Snowflake Alert (CREATE ALERT + EXECUTE TASK), this becomes the contents of the email that lands on the DE on-call, not the global data on-call. The metric stops being a number in a dashboard and starts being a routing decision.

Ungrouped, this same column was an 11 % reading nobody woke up for. Grouped, it is a single row that names the bucket, names the rate, and routes to the only person who can fix it. That is the whole feature, in one query. — Field report, §4
05The four limits worth printing on the wall

The clause is small. The footguns are predictable.

The DE row arrived because the test bed had four regions. On a customer_id column with 50,000 distinct values, the same query produces no row at all — the run errors loud with the GROUP LIMIT message from §2. None of the limits below are deal-breakers; all of them are the kind of thing a team discovers the hard way three weeks after a successful rollout. The four to bake into the runbook before that happens:

Hard limits, current Enterprise EE
  • GROUP LIMIT caps at 1,000 — and fails loud. Default is 1,000; minimum is 1; maximum is 1,000. If the column has more distinct values than the limit at evaluation time, the entire run errors and no row is written. You do not get a partial result; you get the The number of groups exceeds the configured GROUP LIMIT message from §2. Group on columns whose cardinality you can state out loud.
  • Grouping is immutable. Change the keys, drop and re-add the DMF — ALTER … MODIFY on grouping columns is not supported. Snowflake protects the historical series this way; the side effect is your DDL change is two statements, not one.
  • Anomaly detection auto-disables. Once WITHIN GROUP is set on a DMF, the platform’s anomaly-detection scoring is turned off for that metric. Per-segment alerting is on you (the query in §4 is the substitute).
  • Not supported on every DMF. Specifically: FRESHNESS and REFERENTIAL_INTEGRITY_COUNT ignore WITHIN GROUP, and any user-defined DMF whose body contains a CTE, a JOIN, a UNION, a DISTINCT, or a window function falls back to a single table-level value. The grouping is a syntactic transform applied to the DMF body; bodies that can’t be cleanly group-by’d don’t qualify.

The rollout pattern that survives these is uneventful. Inventory the columns you would actually want a per-segment reading on (region, tenant, plan tier — not user id, not order id). State each one’s cardinality. Add the WITHIN GROUP DMF, run it once, read GROUP_BY_INFO the way §3 shows. Wire the §4 alert before you switch off the old ungrouped one. The whole loop fits inside a single change window.

Not in this post · future field notes
  • Multi-key groupings in practice. What WITHIN GROUP (signup_region, plan_tier) looks like on a real schema, with cardinality math, the read query, and the bill for the extra rows.
  • Grouped DMFs vs DATA_METRIC_SCHEDULE. The schedule and the grouping interact — per-group time series multiply your row counts in DATA_QUALITY_MONITORING_RESULTS. The retention math, with numbers.
  • User-defined grouped DMFs. Which custom-DMF shapes Snowflake will accept under WITHIN GROUP, and the rewrite pattern for the ones it rejects (kill the CTE, push the JOIN upstream into a view).
From our audit
Most data-quality dashboards alert on the wrong unit. A Crosshire audit inventories every DMF on your account, flags the ones whose ungrouped reading is hiding a per-segment failure, and ships the WITHIN GROUP rewrite with the cardinality math and the alert query. A human reviews the DDL before it lands. You keep the queries, the thresholds, and the runbook.
Start a conversation →
Sources
· · ·

Numbers in this note come from a designed test bed on Snowflake Enterprise Edition (EU-Frankfurt), seeded with a region-skewed null distribution — 400 rows across four regions, 44 null emails concentrated in DE — and a WITHIN GROUP (signup_region) DMF attached against it. The seed bed design is the provenance: same diagnostic, same cardinality, every Crosshire audit. — Crosshire

D
writes Crosshire Journal · crosshire.ch · May 2026
Crosshire Journal
Field reports on data, compute, and the unglamorous decisions that shape engineering teams. Made in EU. Cited evidence, GDPR-native.
Data quality·Quick fix·4 min read·25 May 2026

Your dashboard says 11 %. One region is at 40 %.

The dashboard tile read 11 % null on customers.email for weeks. Then we added WITHIN GROUP (signup_region) — one clause on the same ALTER TABLE — and the next run wrote four rows instead of one. DE alone carried 40 of 44 nulls. The other three regions had one or two each. The same column, the same scan, two readings.

40 / 44
nulls in DE alone · out of 44 total
Table-level reading: 11 % · DE reading: 40 %
Snowflake Enterprise Edition · EU-Frankfurt · one DDL clause
0% 15% 30% 45% Table avg 11 % 40 % 2 % 1 % 1 % DE UK US FR
One bar carries the failure. The dashed line is the ungrouped reading the dashboard shows.
Provenance · what happened

1Nightly DMF on customers.email read 44 nulls across 400 rows — 11.0 %. The dashboard tile sat there for weeks. Uncomfortable; not pageable.

2Snowflake’s WITHIN GROUP clause for DMFs is GA on Enterprise EE. One line on the ALTER TABLEWITHIN GROUP (signup_region) — emits one result row per region per run, stored in DATA_QUALITY_MONITORING_RESULTS with a new GROUP_BY_INFO array.

3Per-region reading: DE 40, UK 2, US 1, FR 1. A threshold-10 alert query returned exactly one row, named the segment, and routed the page to the right on-call. The dashboard hid the broken segment; the grouped DMF named it.

01The problem

A table average hides a regional failure.

A column-level NULL_COUNT on a multi-region table answers exactly one question: how many nulls are in this column across the whole table? One segment can run at 100 % failure without the table-level number moving enough to page. On the test bed, DE was at 40 % and the dashboard read 11 %.

02The fix · one DDL

One clause: WITHIN GROUP (signup_region).

Attach a grouped DMF. Snowflake stores one row per group per run in SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS, with the segment key in a new GROUP_BY_INFO array column.

Attach · read · alertone DDL, one SELECTsql
ALTER TABLE <your_table>
  ADD DATA METRIC FUNCTION SNOWFLAKE.CORE.NULL_COUNT
    ON (email)
    WITHIN GROUP (signup_region)
    GROUP LIMIT 1000;

-- Read per-region results
SELECT group_by_info[0]:value::STRING AS signup_region,
       value::NUMBER                  AS null_count
FROM   SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
WHERE  table_name = '<your_table>'
  AND  ARRAY_SIZE(group_by_info) > 0
  AND  value::NUMBER > <your_threshold>
ORDER BY measurement_time DESC;
03The result

One row. The right on-call.

Per-region null counts, 2026-05-25 03:00 UTC run
signup_region null_count rate
DE4040.0 %
UK22.0 %
US11.0 %
FR11.0 %

The alert query with a threshold of 10 returns exactly one row — signup_region = DE, null_count = 40. It names the segment and routes the page to the only person who can fix it.

Ungrouped, this was a number nobody woke up for. Grouped, it is a routing decision. — Crosshire audit, test bed
Want the receipts?
The long version unpacks three things this short can’t.
  • The GROUP_BY_INFO shape, in full. What the array looks like for single- and multi-key groupings, and the index pattern that survives a column rename.
  • The four hard limits. The loud-fail GROUP LIMIT error at 1,000 distinct groups (entire run errors, no row written), the immutable grouping, the auto-disabled anomaly detection, and the DMF shapes WITHIN GROUP rejects (CTEs, JOINs, windows, FRESHNESS).
  • The chart and the runbook. A per-region null-rate bar over the dashed table-level line, plus the rollout pattern that survives the footguns — one change window, two statements.
D
writes Crosshire Journal · crosshire.ch · May 2026
Two-minute field fixes from the same audits as our long-form Journal. One number, one fix, one result you can verify.
Crosshire Quick
© 2026 Crosshire Journal · Made in EU Privacy Terms Cookies License Imprint Coffee