The scorecard — who found what, who was wrong Full sweep, every contradiction, the adjudication SQL

Tech·Agentic analytics·Part 3 of 5·10 min read·20 Jun 2026

Three analysts, three blind spots: the open-ended sweep.

The prompt: find every insight in this data — FinOps, performance, ops, whatever else. The result: 63.55 GB of disk spill, one query responsible for 26.75 GB, and I had written “clean” against both of those in my manual notes. A naive agent with one sentence of context falsified two of my own assertions. Then skilled-Opus made a units error that naive didn’t. No analyst dominated. The union beat everyone.

Darshan Singh

Crosshire auditSnowflake · ACCOUNT_USAGEJune 2026

63.55 GB

total disk spill the human analyst missed
26.75 GB · single worst query · 47,420 queries total
found by naive-Opus · open-ended sweep · $0.76

Open-ended sweep · three analysts · same 4-day export

Analyst	Notable finds	Confirmed errors
Human	Domain context, ground-truth idle burn	Two false “clean” claims: queuing, disk spill
Skilled-Opus	Attribution, cost-per-pattern, storage duplication, account parameters	Cache column misread as percentage (units bug)
Naive-Opus	Queuing crisis, disk spill, compile ratio, four more governance params	Missed the headline idle-burn number entirely

In this report

The open-ended sweep and what each analyst found
The queuing crisis — real numbers, real spread
The cache units bug — more docs, worse answer
The human’s false clean claims
Contradictions are the product

Series · Skills are the floor, not the ceiling

The $15 lab — the anchor finding and the 2×2 design
Skills vs model — the directed question and three lawssoon
Three analysts, three blind spots — the open-ended sweep
Variance and budget death — traces as refereesoon
The operating model — prompt, tiers, economicssoon

Provenance

63.55GB disk spill

26.75GB, worst query

61.9%cache avg

79.3%ACCOUNTADMIN

$0.76sweep cost

Account: A seeded Crosshire trial Snowflake account, not a customer
Window: 16–20 May 2026, ~4 days of metered data (47,420 queries, 39 warehouses)
Sweep run: Trace 234345 · skilled-Opus · 31 tool calls · 9.6 min · $0.7595
Sources: QUERY_HISTORY · WAREHOUSE_METERING_HISTORY · QUERY_ATTRIBUTION_HISTORY · ACCOUNT_PARAMETERS
Cost basis: $3.00/credit Snowflake list rate, disclosed as an estimate

Seeded trial account Every number from SQL Method-only public release

01The sweep, and what each analyst found

I predicted skilled-Opus would be a superset. It wasn’t close.

Part 2 of this series tested a directed question — “why is COMPUTE_WH expensive?” — where there is a known right answer. This round is the harder test: an open-ended sweep with no single correct scope. Coverage is the game.

My prediction: skilled-Opus — frontier model plus five playbooks plus an anomaly-discovery checklist — would produce a superset of everything worth finding, and naive-Opus would be a noisier subset. I would spend the round grading skilled’s report against my manual notes and call it done.

The skilled sweep (trace 234345, 31 tool calls, 9.6 minutes, $0.7595) surfaced the idle-burn attribution, a single query pattern responsible for 84.9% of attributed compute on COMPUTE_WH (the v_cpu_heavy family), 145.8 GB of active storage held in duplicate table names across 11 names, the security posture (zero MFA on both users, ACCOUNTADMIN as the running role for 79.3% of all queries), and several non-default account parameters with governance implications — cross-region AI processing enabled, session-scoped temp objects disabled.

Naive-Opus finished its sweep with substantial overlap with skilled on those findings. The majority of naive’s findings, however, were things skilled never looked at. They included the two items that falsified my own notes, which I will come to in §4. They also included a queuing crisis and a compile-time anomaly that skilled skipped entirely, because neither appears in my playbooks as a first-pass check.

And naive-Opus missed the headline idle-burn number — the finding every skilled run leads with, because the playbook’s first query computes it. Without the playbook, naive did not look at QUERY_ATTRIBUTION_HISTORY. The most important number in the account was invisible to it.

Neither report dominated. Skilled and naive did not produce better and worse versions of the same analysis. They explored different regions of the same territory, because one was guided by a checklist of known places and the other by whatever the data suggested.

02The queuing crisis

96.1% of SELECT queries queued. Waits from 4.6 to 115 seconds.

The naive sweep flagged a queuing problem on the stress-test warehouses. Adjudication against QUERY_HISTORY confirmed it — and the real numbers are more varied than the draft summary suggested. The adaptive warehouses (QTM2, QTM4) had 96.1% of their SELECT queries queued; QTM10 was at 94.1%. The classic stress warehouses (MC1/MC10_AUTO/MC4_AUTO) sat at 84.3% queued; MC4_PRE at 41.2%; MC10_PRE at 0%. Average queue waits ranged from 4.6 seconds to 115.6 seconds depending on warehouse.

Queuing on STRESS warehouses · SELECT queries % queued · open-ended sweep

Five warehouse families, one sweep. Adaptive warehouses absorbed the highest queuing; the pre-provisioned MC10_PRE had none. Average wait times ranged from 4.6 s (low-queue end) to 115.6 s (high-queue end). Source: QUERY_HISTORY WHERE QUERY_TYPE=‘SELECT’, stress warehouses only.

The compile-time ratio on COMPUTE_WH is a separate data point from the same sweep: 49.9% of total elapsed time on that warehouse was compilation rather than execution, against a 5.8% account-wide baseline. Half the warehouse’s time was planning, not running. Neither of these showed up in the skilled sweep.

03The cache units bug

I had written “clean” for checks I had never run.

While adjudicating the agent disagreements, I compared the naive findings against my own manual analysis — done carefully, before any agent ran. My report had a “checks that came back clean” section. Two entries: query queuing and disk spill.

Both were wrong. The queuing crisis is real — verified at 84–96% queued on the stress warehouses, with the exact per-warehouse spread above. The disk spill is real to the decimal: 63.55 GB total local spill across the window, the single worst query responsible for 26.75 GB on its own.

I had written “clean” for those entries because I had, in honesty, never properly run the checks. I knew the shape of what to look for but had not executed the queries. The naive agent, with one sentence of context and no playbook, queried those columns and found what I had asserted was not there.

Three analysts · open-ended sweep · final scorecard

Analyst	Unique strengths	Confirmed errors
Human (me)	Domain context, the original ground truth for idle burn	Two false “clean” claims — queuing and disk spill
Skilled-Opus	Attribution, cost-per-pattern (`v_cpu_heavy` = 84.9%), storage duplication, parameter sweep	Cache column misread as percent rather than fraction
Naive-Opus	Queuing (84–96%), disk spill (63.55 GB / 26.75 GB), compile ratio (49.9%), four more governance params	Missed idle-burn attribution; misread a resource-monitor field

Every analyst had blind spots. Every analyst made at least one outright error. And the union of all three, after a four-query adjudication pass, beat every individual — including the human.

05Contradictions are the product

The disagreements weren’t noise around the signal. They were the signal.

The instinct when two agent runs disagree is to treat the discrepancy as a reliability problem — the system is inconsistent, pick the better report, hide the gap. The data says the opposite. Every contradiction in this experiment, when adjudicated, exposed something real: a units bug, a documentation gap, a human’s unexamined assumption.

Each contradiction also came with a built-in test: run the SQL and see who is right. The adjudication is cheap. In my case, four queries settled all the disagreements. Shipping either report alone means shipping its blind spots unexamined. Running both and diffing them surfaces the errors in both — including the human’s.

The operating consequence, which Part 5 builds into a process: on open-ended work, run both configurations, diff the findings, adjudicate every conflict deterministically, and publish the union. The skilled agent finds what its playbook points at. The naive agent follows the data wherever it goes. Neither covers the full territory. Both together, with a human reading the diff, do.

Disk spill — total and worst single queryACCOUNT_USAGEsql

SELECT
  SUM(BYTES_SPILLED_TO_LOCAL_STORAGE)  AS total_spill_bytes,
  MAX(BYTES_SPILLED_TO_LOCAL_STORAGE)  AS max_single_query_bytes,
  SUM(BYTES_SPILLED_TO_REMOTE_STORAGE) AS remote_spill_bytes
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE START_TIME >= DATEADD('day', -30, CURRENT_TIMESTAMP());

On this account the query returned 63.55 GB local spill (max single query 26.75 GB) and 0.11 GB remote. Any result above a few GB on a warehouse smaller than Large is worth investigating: spill means the warehouse ran out of memory and had to write to disk, which compounds into slow queries and inflated metering on a warehouse already under pressure.

Cache hit rate — actual fraction vs percentage mistakeACCOUNT_USAGEsql

-- PERCENTAGE_SCANNED_FROM_CACHE is a fraction (0.0–1.0), not a percentage.
-- Multiply by 100 to display as percent.
SELECT
  ROUND(AVG(PERCENTAGE_SCANNED_FROM_CACHE) * 100, 1) AS avg_cache_pct,
  COUNT_IF(PERCENTAGE_SCANNED_FROM_CACHE > 0.9)       AS queries_above_90pct,
  COUNT(*)                                               AS scanning_queries
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE BYTES_SCANNED > 0
  AND   START_TIME >= DATEADD('day', -30, CURRENT_TIMESTAMP());

The comment in that query is the fix for the units bug: state the scale explicitly. When this column appears in the skilled agent’s schema documentation without a unit note, it reads as a percentage because the column name says “percentage.” The naive agent sampled the values (0.67, 0.91, 0.43) and inferred the correct scale. The lesson is in Part 5’s prompt rules.

What this round does not settle

Stability. One sweep is one sample. Run the same naive configuration tomorrow and you may get different emphasis, different regions of the territory explored. Part 4 is the round where that assumption failed — instructively.
Completion. Both agents finish on open-ended work in this round. They do not always. Part 4 is also where one frontier-model run stopped without composing a report, and how the trace turned that into a lesson rather than a loss.

From our audit

A Crosshire audit runs both configurations — guided and open-ended — diffs the findings, and adjudicates every disagreement with SQL before anything lands in the report. The round described here cost $0.76 in agent time and caught 63.55 GB of disk spill my own notes called clean. If your account has numbers you trust but haven’t queried, the same method finds out.

Start a conversation →

Sources & further reading

Crosshire Journal · The $15 lab: what four AI agents found in one Snowflake account — the anchor finding and the 2×2 design this series is built on
Crosshire Journal · Your Snowflake login count is probably lying to you — the zero-MFA and ACCOUNTADMIN posture this account also surfaced
Crosshire Journal · Adaptive vs classic warehouses — the adaptive queuing behaviour the sweep measured at 94–96%
Snowflake docs · QUERY_HISTORY — ACCOUNT_USAGE — BYTES_SPILLED_TO_LOCAL_STORAGE, PERCENTAGE_SCANNED_FROM_CACHE, and the rest of the columns used here
Snowflake docs · QUERY_ATTRIBUTION_HISTORY — the attributed-compute column that separates idle from work, and that naive-Opus never queried

· · ·

Numbers in this note come from a seeded Crosshire trial Snowflake account — not a customer — snapshot 16–20 May 2026, approximately four days of metered data, re-derived against the same DuckDB the agents queried. Credit-to-dollar figures use the $3.00/credit list rate as a disclosed estimate. A model read every number; a human ratified every one before it shipped. — Crosshire

Darshan Singh

writes Crosshire Journal · crosshire.ch · June 2026

Crosshire Journal

Field reports on data, compute, and the unglamorous decisions that shape engineering teams. Made in EU. Cited evidence, GDPR-native.

Tech·Agentic analytics·Part 3 of 5·2 min read·20 Jun 2026

63.55 GB of spill. I wrote “clean.”

Three analysts swept the same Snowflake export. No analyst dominated. The union of all three — after four adjudication queries — beat everyone, including the human who ran the manual analysis first.

Darshan Singh

Crosshire auditSnowflake · ACCOUNT_USAGEJune 2026

63.55 GB

total disk spill the human analyst missed
26.75 GB · single worst query · 47,420 queries
found by naive-Opus · $0.76 sweep

Queuing rates across stress warehouses. Adaptive (QTM) warehouses hit 94–96%; the pre-provisioned MC10_PRE had none. Source: QUERY_HISTORY, SELECT queries only.

Provenance · what happened

1Both skilled-Opus and naive-Opus swept the same 4-day ACCOUNT_USAGE export open-endedly. Coverage overlapped substantially, but each explored different regions of the same territory.

2The sweeps disagreed on cache: skilled read the column as a percentage, naive sampled values and got it right. Four adjudication queries settled every conflict.

3Checking the naive findings against my own manual analysis: 63.55 GB of disk spill and a queuing crisis I had both marked “clean.” I had never run those queries. The agent had.

Seeded trial account Every number from SQL Method-only public release

01The problem

Two clean claims. Both wrong.

Before any agent ran, I did a manual analysis of the same export. My notes had a “clean” section. Disk spill was in it. Query queuing was in it. Naive-Opus, with one sentence of context, queried those columns and found 63.55 GB of local spill (max single query 26.75 GB) and SELECT queuing rates of 84–96% on the stress warehouses. Neither of those is clean.

02Why

Skilled and naive explore different territory.

Skilled-Opus followed its playbook: attribution, cost-per-pattern, storage, governance parameters. It surfaced the v_cpu_heavy family at 84.9% of attributed compute, 145.8 GB in duplicate table names, and the zero-MFA security posture. It missed the queuing and the spill because neither is a first-pass playbook check.

Naive followed the data wherever it went, but missed the idle-burn attribution entirely because it never queried QUERY_ATTRIBUTION_HISTORY without a playbook to prompt it.

03The fix

Run both. Diff them. Adjudicate every conflict.

The union of all three analysts, after four SQL adjudication queries, beat every individual. The contradictions were not reliability noise. Each one came with a built-in test (run the query) and a built-in fix (document the unit, add the check, correct the record). Shipping either report alone ships its blind spots unexamined.

Want the full sweep?

The long version adds three things this short can’t.

The cache units bug in detail. Why more documentation produced the wrong answer, and the prompt fix.
The queuing chart with all five warehouse families. Real spread from 0% to 96.1%, not a summary range.
Two runnable queries. Disk spill and cache hit rate — copy-paste them on your own account today.

Sources

Darshan Singh

writes Crosshire Journal · crosshire.ch · June 2026

Two-minute field fixes from the same audits as our long-form Journal. One number, one fix, one result you can verify.

Crosshire Quick

Three analysts, three blind spots: the open-ended sweep.

I predicted skilled-Opus would be a superset. It wasn’t close.

96.1% of SELECT queries queued. Waits from 4.6 to 115 seconds.

More documentation. Worse answer.

I had written “clean” for checks I had never run.

The disagreements weren’t noise around the signal. They were the signal.

63.55 GB of spill. I wrote “clean.”

Two clean claims. Both wrong.

Skilled and naive explore different territory.

Run both. Diff them. Adjudicate every conflict.

Journal

Crosshire

Legal