Crosshire
The scorecard — who found what, who was wrong Full sweep, every contradiction, the adjudication SQL
Tech·Agentic analytics·Part 3 of 5·10 min read·20 Jun 2026

Three analysts, three blind spots: the open-ended sweep.

The prompt: find every insight in this data — FinOps, performance, ops, whatever else. The result: 63.55 GB of disk spill, one query responsible for 26.75 GB, and I had written “clean” against both of those in my manual notes. A naive agent with one sentence of context falsified two of my own assertions. Then skilled-Opus made a units error that naive didn’t. No analyst dominated. The union beat everyone.

63.55 GB
total disk spill the human analyst missed
26.75 GB · single worst query · 47,420 queries total
found by naive-Opus · open-ended sweep · $0.76
Open-ended sweep · three analysts · same 4-day export
Analyst Notable finds Confirmed errors
Human Domain context, ground-truth idle burn Two false “clean” claims: queuing, disk spill
Skilled-Opus Attribution, cost-per-pattern, storage duplication, account parameters Cache column misread as percentage (units bug)
Naive-Opus Queuing crisis, disk spill, compile ratio, four more governance params Missed the headline idle-burn number entirely
In this report
  1. The open-ended sweep and what each analyst found
  2. The queuing crisis — real numbers, real spread
  3. The cache units bug — more docs, worse answer
  4. The human’s false clean claims
  5. Contradictions are the product
Series · Skills are the floor, not the ceiling
  1. The $15 lab — the anchor finding and the 2×2 design
  2. Skills vs model — the directed question and three lawssoon
  3. Three analysts, three blind spots — the open-ended sweep
  4. Variance and budget death — traces as refereesoon
  5. The operating model — prompt, tiers, economicssoon
Provenance
63.55GB disk spill
26.75GB, worst query
61.9%cache avg
79.3%ACCOUNTADMIN
$0.76sweep cost
Account
A seeded Crosshire trial Snowflake account, not a customer
Window
16–20 May 2026, ~4 days of metered data (47,420 queries, 39 warehouses)
Sweep run
Trace 234345 · skilled-Opus · 31 tool calls · 9.6 min · $0.7595
Sources
QUERY_HISTORY · WAREHOUSE_METERING_HISTORY · QUERY_ATTRIBUTION_HISTORY · ACCOUNT_PARAMETERS
Cost basis
$3.00/credit Snowflake list rate, disclosed as an estimate
Seeded trial account Every number from SQL Method-only public release
01The sweep, and what each analyst found

I predicted skilled-Opus would be a superset. It wasn’t close.

Part 2 of this series tested a directed question — “why is COMPUTE_WH expensive?” — where there is a known right answer. This round is the harder test: an open-ended sweep with no single correct scope. Coverage is the game.

My prediction: skilled-Opus — frontier model plus five playbooks plus an anomaly-discovery checklist — would produce a superset of everything worth finding, and naive-Opus would be a noisier subset. I would spend the round grading skilled’s report against my manual notes and call it done.

The skilled sweep (trace 234345, 31 tool calls, 9.6 minutes, $0.7595) surfaced the idle-burn attribution, a single query pattern responsible for 84.9% of attributed compute on COMPUTE_WH (the v_cpu_heavy family), 145.8 GB of active storage held in duplicate table names across 11 names, the security posture (zero MFA on both users, ACCOUNTADMIN as the running role for 79.3% of all queries), and several non-default account parameters with governance implications — cross-region AI processing enabled, session-scoped temp objects disabled.

Naive-Opus finished its sweep with substantial overlap with skilled on those findings. The majority of naive’s findings, however, were things skilled never looked at. They included the two items that falsified my own notes, which I will come to in §4. They also included a queuing crisis and a compile-time anomaly that skilled skipped entirely, because neither appears in my playbooks as a first-pass check.

And naive-Opus missed the headline idle-burn number — the finding every skilled run leads with, because the playbook’s first query computes it. Without the playbook, naive did not look at QUERY_ATTRIBUTION_HISTORY. The most important number in the account was invisible to it.

Neither report dominated. Skilled and naive did not produce better and worse versions of the same analysis. They explored different regions of the same territory, because one was guided by a checklist of known places and the other by whatever the data suggested.
02The queuing crisis

96.1% of SELECT queries queued. Waits from 4.6 to 115 seconds.

The naive sweep flagged a queuing problem on the stress-test warehouses. Adjudication against QUERY_HISTORY confirmed it — and the real numbers are more varied than the draft summary suggested. The adaptive warehouses (QTM2, QTM4) had 96.1% of their SELECT queries queued; QTM10 was at 94.1%. The classic stress warehouses (MC1/MC10_AUTO/MC4_AUTO) sat at 84.3% queued; MC4_PRE at 41.2%; MC10_PRE at 0%. Average queue waits ranged from 4.6 seconds to 115.6 seconds depending on warehouse.

Queuing on STRESS warehouses · SELECT queries % queued · open-ended sweep
SELECT QUERIES QUEUED · STRESS WAREHOUSES 0% 50% 75% 100% 96.1% QTM2/QTM4 adaptive 94.1% QTM10 adaptive 84.3% MC classic MC1/MC10/MC4 41.2% MC4_PRE 0% MC10_PRE
Five warehouse families, one sweep. Adaptive warehouses absorbed the highest queuing; the pre-provisioned MC10_PRE had none. Average wait times ranged from 4.6 s (low-queue end) to 115.6 s (high-queue end). Source: QUERY_HISTORY WHERE QUERY_TYPE=‘SELECT’, stress warehouses only.

The compile-time ratio on COMPUTE_WH is a separate data point from the same sweep: 49.9% of total elapsed time on that warehouse was compilation rather than execution, against a 5.8% account-wide baseline. Half the warehouse’s time was planning, not running. Neither of these showed up in the skilled sweep.

03The cache units bug

More documentation. Worse answer.

The two sweeps disagreed outright on cache. Skilled: average cache hit roughly 0.6%, near zero. Naive: several warehouses at 99%-plus cache. Both cannot be correct.

Adjudication took one query. The PERCENTAGE_SCANNED_FROM_CACHE column in this export is a fraction (0.0–1.0), not a percentage. A sample row returns 0.6787 — meaning 67.9% of data scanned from cache, not 0.6787%. Skilled-Opus read the column as a percentage. Naive-Opus, with no schema documentation to anchor on, sampled the values and read them correctly.

The verified numbers: average cache fraction 0.6194 (61.9%) across 2,458 queries that scanned any data; 1,366 queries had a cache fraction above 0.9 (above 90%). That is a well-cached account, not a near-zero one.

The configuration with more documentation made the units error, because partial documentation licenses assumptions. My schema notes recorded time units in milliseconds and said nothing about the cache column. Naive, which assumes nothing because it was told nothing, sampled the values and got it right.

This is now a prompt rule in Part 5: document the unit of every numeric column explicitly, or leave the column undocumented rather than risk a silent assumption. Partial documentation is not always safer than no documentation.

Two rules this finding did not change
The model never produces a number. The cache fraction 0.6194 came from avg(PERCENTAGE_SCANNED_FROM_CACHE) run against the DuckDB. The model reported the query result; it did not calculate the figure itself.
A human ratifies everything. The units bug was caught by running a second query — a human checking the disagreement, not an agent flagging its own error. The contradiction created the test; the human ran it.
04The human’s false clean claims

I had written “clean” for checks I had never run.

While adjudicating the agent disagreements, I compared the naive findings against my own manual analysis — done carefully, before any agent ran. My report had a “checks that came back clean” section. Two entries: query queuing and disk spill.

Both were wrong. The queuing crisis is real — verified at 84–96% queued on the stress warehouses, with the exact per-warehouse spread above. The disk spill is real to the decimal: 63.55 GB total local spill across the window, the single worst query responsible for 26.75 GB on its own.

I had written “clean” for those entries because I had, in honesty, never properly run the checks. I knew the shape of what to look for but had not executed the queries. The naive agent, with one sentence of context and no playbook, queried those columns and found what I had asserted was not there.

Three analysts · open-ended sweep · final scorecard
Analyst Unique strengths Confirmed errors
Human (me) Domain context, the original ground truth for idle burn Two false “clean” claims — queuing and disk spill
Skilled-Opus Attribution, cost-per-pattern (v_cpu_heavy = 84.9%), storage duplication, parameter sweep Cache column misread as percent rather than fraction
Naive-Opus Queuing (84–96%), disk spill (63.55 GB / 26.75 GB), compile ratio (49.9%), four more governance params Missed idle-burn attribution; misread a resource-monitor field

Every analyst had blind spots. Every analyst made at least one outright error. And the union of all three, after a four-query adjudication pass, beat every individual — including the human.

05Contradictions are the product

The disagreements weren’t noise around the signal. They were the signal.

The instinct when two agent runs disagree is to treat the discrepancy as a reliability problem — the system is inconsistent, pick the better report, hide the gap. The data says the opposite. Every contradiction in this experiment, when adjudicated, exposed something real: a units bug, a documentation gap, a human’s unexamined assumption.

Each contradiction also came with a built-in test: run the SQL and see who is right. The adjudication is cheap. In my case, four queries settled all the disagreements. Shipping either report alone means shipping its blind spots unexamined. Running both and diffing them surfaces the errors in both — including the human’s.

The operating consequence, which Part 5 builds into a process: on open-ended work, run both configurations, diff the findings, adjudicate every conflict deterministically, and publish the union. The skilled agent finds what its playbook points at. The naive agent follows the data wherever it goes. Neither covers the full territory. Both together, with a human reading the diff, do.

Disk spill — total and worst single queryACCOUNT_USAGEsql
SELECT
  SUM(BYTES_SPILLED_TO_LOCAL_STORAGE)  AS total_spill_bytes,
  MAX(BYTES_SPILLED_TO_LOCAL_STORAGE)  AS max_single_query_bytes,
  SUM(BYTES_SPILLED_TO_REMOTE_STORAGE) AS remote_spill_bytes
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE START_TIME >= DATEADD('day', -30, CURRENT_TIMESTAMP());

On this account the query returned 63.55 GB local spill (max single query 26.75 GB) and 0.11 GB remote. Any result above a few GB on a warehouse smaller than Large is worth investigating: spill means the warehouse ran out of memory and had to write to disk, which compounds into slow queries and inflated metering on a warehouse already under pressure.

Cache hit rate — actual fraction vs percentage mistakeACCOUNT_USAGEsql
-- PERCENTAGE_SCANNED_FROM_CACHE is a fraction (0.0–1.0), not a percentage.
-- Multiply by 100 to display as percent.
SELECT
  ROUND(AVG(PERCENTAGE_SCANNED_FROM_CACHE) * 100, 1) AS avg_cache_pct,
  COUNT_IF(PERCENTAGE_SCANNED_FROM_CACHE > 0.9)       AS queries_above_90pct,
  COUNT(*)                                               AS scanning_queries
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE BYTES_SCANNED > 0
  AND   START_TIME >= DATEADD('day', -30, CURRENT_TIMESTAMP());

The comment in that query is the fix for the units bug: state the scale explicitly. When this column appears in the skilled agent’s schema documentation without a unit note, it reads as a percentage because the column name says “percentage.” The naive agent sampled the values (0.67, 0.91, 0.43) and inferred the correct scale. The lesson is in Part 5’s prompt rules.

What this round does not settle
  • Stability. One sweep is one sample. Run the same naive configuration tomorrow and you may get different emphasis, different regions of the territory explored. Part 4 is the round where that assumption failed — instructively.
  • Completion. Both agents finish on open-ended work in this round. They do not always. Part 4 is also where one frontier-model run stopped without composing a report, and how the trace turned that into a lesson rather than a loss.
From our audit
A Crosshire audit runs both configurations — guided and open-ended — diffs the findings, and adjudicates every disagreement with SQL before anything lands in the report. The round described here cost $0.76 in agent time and caught 63.55 GB of disk spill my own notes called clean. If your account has numbers you trust but haven’t queried, the same method finds out.
Start a conversation →
Sources & further reading
· · ·

Numbers in this note come from a seeded Crosshire trial Snowflake account — not a customer — snapshot 16–20 May 2026, approximately four days of metered data, re-derived against the same DuckDB the agents queried. Credit-to-dollar figures use the $3.00/credit list rate as a disclosed estimate. A model read every number; a human ratified every one before it shipped. — Crosshire

D
writes Crosshire Journal · crosshire.ch · June 2026
Crosshire Journal
Field reports on data, compute, and the unglamorous decisions that shape engineering teams. Made in EU. Cited evidence, GDPR-native.
Tech·Agentic analytics·Part 3 of 5·2 min read·20 Jun 2026

63.55 GB of spill. I wrote “clean.”

Three analysts swept the same Snowflake export. No analyst dominated. The union of all three — after four adjudication queries — beat everyone, including the human who ran the manual analysis first.

63.55 GB
total disk spill the human analyst missed
26.75 GB · single worst query · 47,420 queries
found by naive-Opus · $0.76 sweep
SELECT QUERIES QUEUED · STRESS WAREHOUSES 96.1% QTM2/4 94.1% QTM10 84.3% MC classic 41.2% MC4_PRE 0% MC10_PRE
Queuing rates across stress warehouses. Adaptive (QTM) warehouses hit 94–96%; the pre-provisioned MC10_PRE had none. Source: QUERY_HISTORY, SELECT queries only.
Provenance · what happened

1Both skilled-Opus and naive-Opus swept the same 4-day ACCOUNT_USAGE export open-endedly. Coverage overlapped substantially, but each explored different regions of the same territory.

2The sweeps disagreed on cache: skilled read the column as a percentage, naive sampled values and got it right. Four adjudication queries settled every conflict.

3Checking the naive findings against my own manual analysis: 63.55 GB of disk spill and a queuing crisis I had both marked “clean.” I had never run those queries. The agent had.

01The problem

Two clean claims. Both wrong.

Before any agent ran, I did a manual analysis of the same export. My notes had a “clean” section. Disk spill was in it. Query queuing was in it. Naive-Opus, with one sentence of context, queried those columns and found 63.55 GB of local spill (max single query 26.75 GB) and SELECT queuing rates of 84–96% on the stress warehouses. Neither of those is clean.

02Why

Skilled and naive explore different territory.

Skilled-Opus followed its playbook: attribution, cost-per-pattern, storage, governance parameters. It surfaced the v_cpu_heavy family at 84.9% of attributed compute, 145.8 GB in duplicate table names, and the zero-MFA security posture. It missed the queuing and the spill because neither is a first-pass playbook check.

Naive followed the data wherever it went, but missed the idle-burn attribution entirely because it never queried QUERY_ATTRIBUTION_HISTORY without a playbook to prompt it.

03The fix

Run both. Diff them. Adjudicate every conflict.

The union of all three analysts, after four SQL adjudication queries, beat every individual. The contradictions were not reliability noise. Each one came with a built-in test (run the query) and a built-in fix (document the unit, add the check, correct the record). Shipping either report alone ships its blind spots unexamined.

Want the full sweep?
The long version adds three things this short can’t.
  • The cache units bug in detail. Why more documentation produced the wrong answer, and the prompt fix.
  • The queuing chart with all five warehouse families. Real spread from 0% to 96.1%, not a summary range.
  • Two runnable queries. Disk spill and cache hit rate — copy-paste them on your own account today.
D
writes Crosshire Journal · crosshire.ch · June 2026
Two-minute field fixes from the same audits as our long-form Journal. One number, one fix, one result you can verify.
Crosshire Quick
© 2026 Crosshire Journal · Made in EU Privacy Terms Cookies License Imprint Coffee