The 2×2 result and the wrong answer All four traces, three laws, the discipline

Tech·Agentic analytics·Part 2 of 5·9 min read·20 Jun 2026

Skills vs model capability: the 2×2, measured.

One directed question — “Why is COMPUTE_WH so expensive?” — sent to four agent configurations. Ground truth first: 43.86 credits metered, 98.92% idle, 0.4751 attributed to query work. The cheapest correct answer cost $0.19. The confident wrong answer cost $0.55 and contained zero queries against the attribution column. Three laws fall out of the traces — and the third one is the one that keeps you up at night.

Darshan Singh

Crosshire auditSnowflake · ACCOUNT_USAGEJune 2026

$0.19

skilled-Sonnet · correct answer · cheapest run
15 tool calls · 248 s · 2 errors
naive-Sonnet: $0.55, confident, wrong

The 2×2 · directed question · same account, same turn budget

Configuration	Cost	Tool calls	Duration	Errors	Result
Skilled-Sonnet (trace 213316)	$0.1885	15	248 s	2	Correct — cheapest, fastest
Skilled (frontier) (trace 212343)	$0.4934	26	402 s	13	Correct — culprit vague
Naive-Opus (trace 212903)	$0.6992	24	343 s	8	Correct & deepest root cause
Naive-Sonnet (trace 214051)	$0.5504	29	471 s	6	Confident — wrong

In this report

The question and the ground truth
The grid, cell by cell
Three laws from four traces
The confident wrong answer
What the grid leaves open

Series · Skills are the floor, not the ceiling

The $15 lab — the anchor finding and the 2×2 design
Skills vs model — the directed question and three laws
Union beats everyone — the open-ended sweepsoon
Variance and budget death — traces as refereesoon
The operating model — prompt, tiers, economicssoon

Provenance

43.86credits, 4 days

98.92%idle

$0.19cheapest correct

4traces compared

0attribution queries, naive-Sonnet

Account: A seeded Crosshire trial Snowflake account, not a customer
Window: 16–20 May 2026, ~4 days of metered data (47,420 queries, 39 warehouses)
Question: “Why is COMPUTE_WH so expensive?” — directed, four configurations, same turn budget
Cheap tier: Sonnet — there are no Haiku traces. The cheap cell is Sonnet throughout this series.
Cost basis: $3.00/credit Snowflake list rate, disclosed as an estimate

Seeded trial account Every number from SQL or trace JSONL Method-only public release

01The question and the ground truth

43.86 credits, 98.92% idle. The agents didn’t know that yet.

Before any agent ran, the ground truth was established by hand against the DuckDB. COMPUTE_WH metered 43.86 credits over four days — 53.2% of the account’s warehouse credits. Of those, 0.4751 were attributed to actual query work by QUERY_ATTRIBUTION_HISTORY. The remainder — 98.92% idle — was a warehouse billing for being awake. The mechanism, established in Part 1: a notebook container service with AUTO_SUSPEND_SECS = 0 polling around the clock, 37,399 inter-query gaps on COMPUTE_WH, only 120 of them (0.32%) exceeding the 60-second auto-suspend window. The warehouse never slept.

That is the target. Four configurations were then asked the same question: “Why is COMPUTE_WH so expensive?” Skilled agents get the database schema, unit conventions, and five investigative playbooks. Naive agents get one sentence. The cheap tier is Sonnet; the frontier tier is Opus — except for one trace (212343) where the model was set to “default” and the exact model version was not recorded. That run is labeled skilled (frontier) in the table and throughout this note; calling it Opus would assert something the trace does not confirm.

02The grid, cell by cell

Four cells. Three correct answers. One fluent wrong one.

Skilled-Sonnet (trace 213316): $0.1885, 15 tool calls, 248 seconds, 2 errors. Correct. The playbook’s opening move — compare metered credits to attributed credits, then identify the query types keeping the warehouse awake — is exactly the causal model needed. The cheap model didn’t need to derive how Snowflake billing works from first principles. It needed to execute a competent investigation of a well-understood domain, and it did. Minor unit wobbles in savings estimates appear in the trace, consistent with the model executing playbook steps rather than reasoning from scratch. The production consequence: for recurring, well-understood question types, this cell is the answer. Fifteen tool calls. Under $0.20.

Skilled (frontier) (trace 212343): $0.4934, 26 tool calls, 402 seconds, 13 errors. Correct — the idle burn identified, the polling behaviour surfaced — but the culprit stayed vague. The skilled run’s playbook ends at “polling operations.” The frontier model executing that playbook stops where the playbook stops. It found the right answer faster than the naive frontier run but not more deeply. At 2.62× the cost of skilled-Sonnet ($0.4934 / $0.1885) and with 13 errors vs 2, this is the cell where you are paying for model capability and getting playbook execution.

Naive-Opus (trace 212903): $0.6992, 24 tool calls, 343 seconds, 8 errors. Correct, and the deepest root cause of any run. With no playbook to constrain its path, the frontier model did what no skilled run did: it pulled the most-repeated query texts and read them. There sat SYSTEM$NOTEBOOK_CONTAINER_PARAMS — the notebook-container heartbeat, one level deeper than my playbook’s vocabulary. It then proved the mechanism statistically: of 37,399 inter-query gaps, only 120 (0.32%) exceeded the 60-second suspend window. My playbook does not contain that gap analysis, because I had not thought to write it before this run. The unconstrained frontier run is how the playbooks grow.

Naive-Sonnet (trace 214051): $0.5504, 29 tool calls, 471 seconds, 6 errors. Wrong. This is the cell to examine.

2×2 · cost vs correctness 4 traces · same question · same account

Cost is not a proxy for correctness. Naive-Sonnet at $0.55 cost more than skilled-Sonnet at $0.19 and was wrong. Naive-Opus at $0.70 went deepest. The two skilled runs converged on the correct answer; their cost gap is the price of frontier model capability when the playbook already carries the causal model.

03Three laws from four traces

What the playbook buys, what the model buys, and where they stop.

Law 1: Skills substitute for model capability on known questions. Skilled-Sonnet delivered essentially the same answer as the frontier skilled run at 38% of the cost ($0.1885 / $0.4934 = 38.2%). The playbook’s first move is the causal model: compare metered to attributed credits; a large gap means idle burn; then identify what keeps the warehouse awake. The cheap model does not need to derive that from domain knowledge it may or may not have internalized. It executes the investigation competently, including arithmetic the playbook never spells out. For recurring, known question types, the production implication is direct: you do not pay frontier rates. You pay Sonnet rates.

Law 2: Model capability substitutes for skills on unknown questions. Naive-Opus had no map so it read the territory. It pulled query texts and found the notebook heartbeat function. It ran a gap distribution the playbook had not imagined. It discovered the mechanism one level deeper than the playbook’s vocabulary. The uncomfortable symmetry: both skilled runs — cheap model and frontier model alike — stopped at exactly the same depth. The depth where the playbook stops. Skills encode the known frontier; they cap discovery there, because you cannot write playbooks beyond what you already know. The unconstrained frontier run is not a control group. It is how the playbooks get updated.

Law 3: Cheap-and-naive produces confident wrong analysis with no formatting tell. This is the one that matters for deployment decisions, so the next section covers it fully.

04The confident wrong answer

Facts, no causal framework, no formatting tell.

Naive-Sonnet found real facts. The overwhelming majority of queries on COMPUTE_WH were repeats. Metadata operations dominated. It even speculated “likely from notebooks or BI tools.” It brushed against the right answer. Then it assembled those facts into a wrong causal model: each query incurs a minimum credit cost; volume is the main driver; large scans are secondary. Its recommendations — query caching, WHERE clauses — would have consumed an engineering sprint while the warehouse kept burning.

The damning detail is in the trace. Naive-Sonnet ran zero queries against the credit-attribution column. It never compared metered credits to attributed credits. It never computed the 98.92% idle figure. It invented a billing model where query volume costs money, because it lacked the domain knowledge to arrange the facts it found — and nobody had supplied it.

Note what this failure is not. It is not hallucination. It is not incompetence at SQL. Every number in its report is real. The failure is facts without a causal framework, delivered with the same fluent confidence and the same formatting as the correct answers in the other three cells. There is no hedging tell. There is no low-confidence qualifier. The only way to catch it is to know the domain well enough to check the claim — or to read the trace.

The trace is the only witness that doesn’t lie. Naive-Sonnet’s trace shows 29 tool calls and not one of them touches CREDITS_ATTRIBUTED_COMPUTE. Every dispute about what an agent actually did reduces to this: grep the JSONL.

Two disciplines this experiment does not relax

The model never produces a number. Every credit figure, every percentage, every gap count in this note comes from a SQL query against the DuckDB — or from the trace JSONL itself. The model reads numbers; it does not compute the ones it reports. When the naive-Sonnet analysis says “queries cost credits,” that claim is wrong — not because the model invented a number, but because it assembled real numbers into the wrong model. The discipline catches the second kind of error only if a person checks the reasoning, not just the numbers.

A human ratifies everything. The 98.92% idle figure is not in the note because an agent asserted it. It is there because a query against QUERY_ATTRIBUTION_HISTORY returned 0.4751 attributed credits, and 1 − (0.4751 / 43.86) = 0.9892. The agent read that number; a person checked the arithmetic and confirmed the denominator. That step is not optional. Law 3 exists precisely because it is skippable.

A frame that made this concrete for several colleagues: the playbook is a senior engineer’s knowledge, written down. Skilled-Sonnet is a junior analyst with an excellent runbook — fast, cheap, reliable inside the runbook’s borders. Naive-Opus is a brilliant outside consultant with no context — expensive, slower, occasionally finds the thing nobody internal could. Naive-Sonnet is a junior analyst with no runbook working a domain they do not understand — and the analysis looks exactly like the correct one. Nobody staffs the last configuration knowingly. The grid says: with agents, you can deploy it by accident, and the per-run cost will tempt you to.

05What the grid leaves open

A directed question is the easy case. Two harder ones remained.

The directed-question setup is a controlled test — there is a known answer to converge on, and “correct” has a crisp definition. Real audits rarely work this way. Two harder questions followed this grid and each one changed what I thought I knew.

The first: what happens on a fully open-ended sweep — “find everything” — where there is no single answer and coverage matters more than depth? The second: how stable are these results run to run? On a directed question, the agents converged. On an open sweep, they did not. And on the stability question, I was wrong about something I had written down in my own analysis. Both of those are in Part 3.

One open verification item from this grid: the trace labeled 212343 ran with --model default. What model “default” resolved to at runtime is not recorded in the JSONL. The cost ($0.4934 for 26 calls, 402 seconds) is consistent with a frontier-tier model, but the exact model ID is unknown. That cell is labeled “skilled (frontier)” throughout this series rather than “skilled-Opus,” and the distinction matters when the series reaches the economics in Part 5.

Not in this post

The open-ended sweep results. Covered in Part 3: which cells dominated, which findings appeared in only one, and what the union of all analysts — human included — got wrong.
Run-to-run variance. The directed question is stable. The open sweep is not. Part 4 covers what changed between two naive-Opus sweeps and why the trace settled the disagreement.
The full economics. Skilled-Sonnet at $0.19 per question, skilled-Opus audit at $1.77, the idle burn run-rate in dollars. Part 5.

From our audit

Naive-Sonnet’s $0.55 run produced a confident, well-formatted wrong answer. The only way to catch it was reading the trace — or knowing the domain. The Crosshire audit method supplies the domain knowledge in the playbook and the human check in the sign-off. If you want to know what a skilled agent finds in your account, the same rig runs on your own export.

Start a conversation →

Sources & further reading

Crosshire Journal · The $15 lab: what four AI agents found in one Snowflake account — Part 1: the lab design, the anchor finding, the 2×2 setup
Crosshire Journal · Your Snowflake login count is probably lying to you — the zero-MFA and DATABRICKS_READER signals this account also raised
Crosshire Journal · Adaptive vs classic warehouses — how the two billing models differ; the adaptive warehouses in this account appear in Part 3
Snowflake docs · QUERY_ATTRIBUTION_HISTORY — the table naive-Sonnet never queried; the column that proves idle burn
Snowflake docs · WAREHOUSE_METERING_HISTORY — metered credits per warehouse; the denominator in the idle-pct calculation

· · ·

Numbers in this note come from a seeded Crosshire trial Snowflake account — not a customer — snapshot 16–20 May 2026, approximately four days of metered data, re-derived against the same DuckDB the agents queried. Trace figures (cost, tool calls, duration, error counts) come from the nine on-disk JSONL files. Credit-to-dollar figures use the $3.00/credit Snowflake list rate, disclosed as an estimate. The model read every number; a human ratified every one before it shipped. — Crosshire

Darshan Singh

writes Crosshire Journal · crosshire.ch · June 2026

Crosshire Journal

Field reports on data, compute, and the unglamorous decisions that shape engineering teams. Made in EU. Cited evidence, GDPR-native.

Tech·Agentic analytics·Part 2 of 5·2 min read·20 Jun 2026

$0.19 correct. $0.55 confident. Wrong.

Four agents, one Snowflake question, same account. The cheapest correct answer cost $0.19. The confident wrong answer cost $0.55 — and its trace contains zero queries against the attribution column.

Darshan Singh

Crosshire auditSnowflake · ACCOUNT_USAGEJune 2026

$0.19

skilled-Sonnet · correct, fastest
naive-Sonnet: $0.55 · confident · wrong
0 attribution queries in the naive-Sonnet trace

Cost is not a proxy for correctness. Naive-Sonnet cost more than skilled-Sonnet and was wrong. Naive-Opus cost the most and went deepest.

Provenance · what happened

1Ground truth established by hand against the DuckDB: COMPUTE_WH metered 43.86 credits over four days, 0.4751 attributed to query work by QUERY_ATTRIBUTION_HISTORY — 98.92% idle.

2Four configurations asked the same directed question: “Why is COMPUTE_WH so expensive?” Skilled agents carried the database schema and five investigative playbooks; naive agents received one sentence and no schema. Three of four found idle burn.

3Skilled-Sonnet answered correctly for $0.19 in 15 tool calls. Naive-Sonnet cost $0.55, produced a fluent, well-formatted answer — and its trace shows zero queries against CREDITS_ATTRIBUTED_COMPUTE. It never computed the idle figure.

Seeded trial account Every number from SQL or trace JSONL Method-only public release

01The problem

One question. Four answers.

The question: “Why is COMPUTE_WH so expensive?” Ground truth: 43.86 credits metered, 98.92% idle. Three of four agents found idle burn. Naive-Sonnet wrote a confident, well-formatted answer built on the wrong model — query volume drives cost, not idle time. Its recommendations would have sent engineers chasing cache hit rates while the warehouse kept billing for being awake.

02Why it matters

No formatting tell. No hedging tell.

Naive-Sonnet’s output reads identically to skilled-Sonnet’s. Cheap-and-naive is the configuration you can accidentally deploy because it looks correct and it is cheap.

03Check your own account

Metered vs attributed. One query.

The query naive-Sonnet never ran. Run this on your own account — any warehouse with a high pct_idle is billing for being awake, not working.

Idle vs working, per warehouseACCOUNT_USAGEsql

WITH metered AS (
  SELECT WAREHOUSE_NAME, SUM(CREDITS_USED) AS credits
  FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
  WHERE START_TIME >= DATEADD('day',-30,CURRENT_TIMESTAMP()) GROUP BY 1),
attributed AS (
  SELECT WAREHOUSE_NAME, SUM(CREDITS_ATTRIBUTED_COMPUTE) AS credits
  FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_ATTRIBUTION_HISTORY
  WHERE START_TIME >= DATEADD('day',-30,CURRENT_TIMESTAMP()) GROUP BY 1)
SELECT m.WAREHOUSE_NAME,
       ROUND(m.credits,2) AS metered_credits,
       ROUND(COALESCE(a.credits,0),2) AS query_credits,
       ROUND(100*(1 - COALESCE(a.credits,0)/NULLIF(m.credits,0)),1) AS pct_idle
FROM metered m LEFT JOIN attributed a USING (WAREHOUSE_NAME)
ORDER BY metered_credits DESC;

Want the full traces?

The long version adds three things this short can’t.

All four cells, line by line. What each configuration queried and why skilled-frontier hit 13 errors vs skilled-Sonnet’s 2.
Three laws, grounded in data. Where playbook and model substitute for each other — and where they stop.
The gap analysis naive-Opus discovered. 37,399 gaps, 120 exceeding 60 s (0.32%) — the finding no playbook run reached.

Sources

Snowflake docs · QUERY_ATTRIBUTION_HISTORY — the table naive-Sonnet never queried; CREDITS_ATTRIBUTED_COMPUTE is the column that proves idle burn
Crosshire Journal · The $15 lab: what four AI agents found in one Snowflake account — Part 1: the lab design, the anchor finding, and the 2×2 setup this experiment runs on

Darshan Singh

writes Crosshire Journal · crosshire.ch · June 2026

Two-minute field fixes from the same audits as our long-form Journal. One number, one fix, one result you can verify.

Crosshire Quick

Skills vs model capability: the 2×2, measured.

43.86 credits, 98.92% idle. The agents didn’t know that yet.

Four cells. Three correct answers. One fluent wrong one.

What the playbook buys, what the model buys, and where they stop.

Facts, no causal framework, no formatting tell.

A directed question is the easy case. Two harder ones remained.

$0.19 correct. $0.55 confident. Wrong.

One question. Four answers.

No formatting tell. No hedging tell.

Metered vs attributed. One query.

Journal

Crosshire

Legal