Crosshire
The 2×2 result and the wrong answer All four traces, three laws, the discipline
Tech·Agentic analytics·Part 2 of 5·9 min read·20 Jun 2026

Skills vs model capability: the 2×2, measured.

One directed question — “Why is COMPUTE_WH so expensive?” — sent to four agent configurations. Ground truth first: 43.86 credits metered, 98.92% idle, 0.4751 attributed to query work. The cheapest correct answer cost $0.19. The confident wrong answer cost $0.55 and contained zero queries against the attribution column. Three laws fall out of the traces — and the third one is the one that keeps you up at night.

$0.19
skilled-Sonnet · correct answer · cheapest run
15 tool calls · 248 s · 2 errors
naive-Sonnet: $0.55, confident, wrong
The 2×2 · directed question · same account, same turn budget
Configuration Cost Tool calls Duration Errors Result
Skilled-Sonnet (trace 213316) $0.1885 15 248 s 2 Correct — cheapest, fastest
Skilled (frontier) (trace 212343) $0.4934 26 402 s 13 Correct — culprit vague
Naive-Opus (trace 212903) $0.6992 24 343 s 8 Correct & deepest root cause
Naive-Sonnet (trace 214051) $0.5504 29 471 s 6 Confident — wrong
In this report
  1. The question and the ground truth
  2. The grid, cell by cell
  3. Three laws from four traces
  4. The confident wrong answer
  5. What the grid leaves open
Series · Skills are the floor, not the ceiling
  1. The $15 lab — the anchor finding and the 2×2 design
  2. Skills vs model — the directed question and three laws
  3. Union beats everyone — the open-ended sweepsoon
  4. Variance and budget death — traces as refereesoon
  5. The operating model — prompt, tiers, economicssoon
Provenance
43.86credits, 4 days
98.92%idle
$0.19cheapest correct
4traces compared
0attribution queries, naive-Sonnet
Account
A seeded Crosshire trial Snowflake account, not a customer
Window
16–20 May 2026, ~4 days of metered data (47,420 queries, 39 warehouses)
Question
“Why is COMPUTE_WH so expensive?” — directed, four configurations, same turn budget
Cheap tier
Sonnet — there are no Haiku traces. The cheap cell is Sonnet throughout this series.
Cost basis
$3.00/credit Snowflake list rate, disclosed as an estimate
Seeded trial account Every number from SQL or trace JSONL Method-only public release
01The question and the ground truth

43.86 credits, 98.92% idle. The agents didn’t know that yet.

Before any agent ran, the ground truth was established by hand against the DuckDB. COMPUTE_WH metered 43.86 credits over four days — 53.2% of the account’s warehouse credits. Of those, 0.4751 were attributed to actual query work by QUERY_ATTRIBUTION_HISTORY. The remainder — 98.92% idle — was a warehouse billing for being awake. The mechanism, established in Part 1: a notebook container service with AUTO_SUSPEND_SECS = 0 polling around the clock, 37,399 inter-query gaps on COMPUTE_WH, only 120 of them (0.32%) exceeding the 60-second auto-suspend window. The warehouse never slept.

That is the target. Four configurations were then asked the same question: “Why is COMPUTE_WH so expensive?” Skilled agents get the database schema, unit conventions, and five investigative playbooks. Naive agents get one sentence. The cheap tier is Sonnet; the frontier tier is Opus — except for one trace (212343) where the model was set to “default” and the exact model version was not recorded. That run is labeled skilled (frontier) in the table and throughout this note; calling it Opus would assert something the trace does not confirm.

02The grid, cell by cell

Four cells. Three correct answers. One fluent wrong one.

Skilled-Sonnet (trace 213316): $0.1885, 15 tool calls, 248 seconds, 2 errors. Correct. The playbook’s opening move — compare metered credits to attributed credits, then identify the query types keeping the warehouse awake — is exactly the causal model needed. The cheap model didn’t need to derive how Snowflake billing works from first principles. It needed to execute a competent investigation of a well-understood domain, and it did. Minor unit wobbles in savings estimates appear in the trace, consistent with the model executing playbook steps rather than reasoning from scratch. The production consequence: for recurring, well-understood question types, this cell is the answer. Fifteen tool calls. Under $0.20.

Skilled (frontier) (trace 212343): $0.4934, 26 tool calls, 402 seconds, 13 errors. Correct — the idle burn identified, the polling behaviour surfaced — but the culprit stayed vague. The skilled run’s playbook ends at “polling operations.” The frontier model executing that playbook stops where the playbook stops. It found the right answer faster than the naive frontier run but not more deeply. At 2.62× the cost of skilled-Sonnet ($0.4934 / $0.1885) and with 13 errors vs 2, this is the cell where you are paying for model capability and getting playbook execution.

Naive-Opus (trace 212903): $0.6992, 24 tool calls, 343 seconds, 8 errors. Correct, and the deepest root cause of any run. With no playbook to constrain its path, the frontier model did what no skilled run did: it pulled the most-repeated query texts and read them. There sat SYSTEM$NOTEBOOK_CONTAINER_PARAMS — the notebook-container heartbeat, one level deeper than my playbook’s vocabulary. It then proved the mechanism statistically: of 37,399 inter-query gaps, only 120 (0.32%) exceeded the 60-second suspend window. My playbook does not contain that gap analysis, because I had not thought to write it before this run. The unconstrained frontier run is how the playbooks grow.

Naive-Sonnet (trace 214051): $0.5504, 29 tool calls, 471 seconds, 6 errors. Wrong. This is the cell to examine.

2×2 · cost vs correctness 4 traces · same question · same account
PREPARATION MODEL TIER SKILLED NAIVE FRONTIER SONNET (CHEAP) Skilled (frontier) $0.49 26 calls · 402 s · 13 errors Correct — culprit vague trace 212343 Skilled-Sonnet $0.19 15 calls · 248 s · 2 errors Correct — cheapest, fastest trace 213316 Naive-Opus $0.70 24 calls · 343 s · deepest root cause trace 212903 Naive-Sonnet $0.55 Confident — WRONG trace 214051
Cost is not a proxy for correctness. Naive-Sonnet at $0.55 cost more than skilled-Sonnet at $0.19 and was wrong. Naive-Opus at $0.70 went deepest. The two skilled runs converged on the correct answer; their cost gap is the price of frontier model capability when the playbook already carries the causal model.
03Three laws from four traces

What the playbook buys, what the model buys, and where they stop.

Law 1: Skills substitute for model capability on known questions. Skilled-Sonnet delivered essentially the same answer as the frontier skilled run at 38% of the cost ($0.1885 / $0.4934 = 38.2%). The playbook’s first move is the causal model: compare metered to attributed credits; a large gap means idle burn; then identify what keeps the warehouse awake. The cheap model does not need to derive that from domain knowledge it may or may not have internalized. It executes the investigation competently, including arithmetic the playbook never spells out. For recurring, known question types, the production implication is direct: you do not pay frontier rates. You pay Sonnet rates.

Law 2: Model capability substitutes for skills on unknown questions. Naive-Opus had no map so it read the territory. It pulled query texts and found the notebook heartbeat function. It ran a gap distribution the playbook had not imagined. It discovered the mechanism one level deeper than the playbook’s vocabulary. The uncomfortable symmetry: both skilled runs — cheap model and frontier model alike — stopped at exactly the same depth. The depth where the playbook stops. Skills encode the known frontier; they cap discovery there, because you cannot write playbooks beyond what you already know. The unconstrained frontier run is not a control group. It is how the playbooks get updated.

Law 3: Cheap-and-naive produces confident wrong analysis with no formatting tell. This is the one that matters for deployment decisions, so the next section covers it fully.

04The confident wrong answer

Facts, no causal framework, no formatting tell.

Naive-Sonnet found real facts. The overwhelming majority of queries on COMPUTE_WH were repeats. Metadata operations dominated. It even speculated “likely from notebooks or BI tools.” It brushed against the right answer. Then it assembled those facts into a wrong causal model: each query incurs a minimum credit cost; volume is the main driver; large scans are secondary. Its recommendations — query caching, WHERE clauses — would have consumed an engineering sprint while the warehouse kept burning.

The damning detail is in the trace. Naive-Sonnet ran zero queries against the credit-attribution column. It never compared metered credits to attributed credits. It never computed the 98.92% idle figure. It invented a billing model where query volume costs money, because it lacked the domain knowledge to arrange the facts it found — and nobody had supplied it.

Note what this failure is not. It is not hallucination. It is not incompetence at SQL. Every number in its report is real. The failure is facts without a causal framework, delivered with the same fluent confidence and the same formatting as the correct answers in the other three cells. There is no hedging tell. There is no low-confidence qualifier. The only way to catch it is to know the domain well enough to check the claim — or to read the trace.

The trace is the only witness that doesn’t lie. Naive-Sonnet’s trace shows 29 tool calls and not one of them touches CREDITS_ATTRIBUTED_COMPUTE. Every dispute about what an agent actually did reduces to this: grep the JSONL.
Two disciplines this experiment does not relax
The model never produces a number. Every credit figure, every percentage, every gap count in this note comes from a SQL query against the DuckDB — or from the trace JSONL itself. The model reads numbers; it does not compute the ones it reports. When the naive-Sonnet analysis says “queries cost credits,” that claim is wrong — not because the model invented a number, but because it assembled real numbers into the wrong model. The discipline catches the second kind of error only if a person checks the reasoning, not just the numbers.
A human ratifies everything. The 98.92% idle figure is not in the note because an agent asserted it. It is there because a query against QUERY_ATTRIBUTION_HISTORY returned 0.4751 attributed credits, and 1 − (0.4751 / 43.86) = 0.9892. The agent read that number; a person checked the arithmetic and confirmed the denominator. That step is not optional. Law 3 exists precisely because it is skippable.

A frame that made this concrete for several colleagues: the playbook is a senior engineer’s knowledge, written down. Skilled-Sonnet is a junior analyst with an excellent runbook — fast, cheap, reliable inside the runbook’s borders. Naive-Opus is a brilliant outside consultant with no context — expensive, slower, occasionally finds the thing nobody internal could. Naive-Sonnet is a junior analyst with no runbook working a domain they do not understand — and the analysis looks exactly like the correct one. Nobody staffs the last configuration knowingly. The grid says: with agents, you can deploy it by accident, and the per-run cost will tempt you to.

05What the grid leaves open

A directed question is the easy case. Two harder ones remained.

The directed-question setup is a controlled test — there is a known answer to converge on, and “correct” has a crisp definition. Real audits rarely work this way. Two harder questions followed this grid and each one changed what I thought I knew.

The first: what happens on a fully open-ended sweep — “find everything” — where there is no single answer and coverage matters more than depth? The second: how stable are these results run to run? On a directed question, the agents converged. On an open sweep, they did not. And on the stability question, I was wrong about something I had written down in my own analysis. Both of those are in Part 3.

One open verification item from this grid: the trace labeled 212343 ran with --model default. What model “default” resolved to at runtime is not recorded in the JSONL. The cost ($0.4934 for 26 calls, 402 seconds) is consistent with a frontier-tier model, but the exact model ID is unknown. That cell is labeled “skilled (frontier)” throughout this series rather than “skilled-Opus,” and the distinction matters when the series reaches the economics in Part 5.

Not in this post
  • The open-ended sweep results. Covered in Part 3: which cells dominated, which findings appeared in only one, and what the union of all analysts — human included — got wrong.
  • Run-to-run variance. The directed question is stable. The open sweep is not. Part 4 covers what changed between two naive-Opus sweeps and why the trace settled the disagreement.
  • The full economics. Skilled-Sonnet at $0.19 per question, skilled-Opus audit at $1.77, the idle burn run-rate in dollars. Part 5.
From our audit
Naive-Sonnet’s $0.55 run produced a confident, well-formatted wrong answer. The only way to catch it was reading the trace — or knowing the domain. The Crosshire audit method supplies the domain knowledge in the playbook and the human check in the sign-off. If you want to know what a skilled agent finds in your account, the same rig runs on your own export.
Start a conversation →
Sources & further reading
· · ·

Numbers in this note come from a seeded Crosshire trial Snowflake account — not a customer — snapshot 16–20 May 2026, approximately four days of metered data, re-derived against the same DuckDB the agents queried. Trace figures (cost, tool calls, duration, error counts) come from the nine on-disk JSONL files. Credit-to-dollar figures use the $3.00/credit Snowflake list rate, disclosed as an estimate. The model read every number; a human ratified every one before it shipped. — Crosshire

D
writes Crosshire Journal · crosshire.ch · June 2026
Crosshire Journal
Field reports on data, compute, and the unglamorous decisions that shape engineering teams. Made in EU. Cited evidence, GDPR-native.
Tech·Agentic analytics·Part 2 of 5·2 min read·20 Jun 2026

$0.19 correct. $0.55 confident. Wrong.

Four agents, one Snowflake question, same account. The cheapest correct answer cost $0.19. The confident wrong answer cost $0.55 — and its trace contains zero queries against the attribution column.

$0.19
skilled-Sonnet · correct, fastest
naive-Sonnet: $0.55 · confident · wrong
0 attribution queries in the naive-Sonnet trace
COST PER RUN · DIRECTED QUESTION · SAME ACCOUNT $0.19 Skilled-Sonnet CORRECT $0.49 Skilled (frontier) CORRECT $0.55 Naive-Sonnet WRONG $0.70 Naive-Opus DEEPEST
Cost is not a proxy for correctness. Naive-Sonnet cost more than skilled-Sonnet and was wrong. Naive-Opus cost the most and went deepest.
Provenance · what happened

1Ground truth established by hand against the DuckDB: COMPUTE_WH metered 43.86 credits over four days, 0.4751 attributed to query work by QUERY_ATTRIBUTION_HISTORY98.92% idle.

2Four configurations asked the same directed question: “Why is COMPUTE_WH so expensive?” Skilled agents carried the database schema and five investigative playbooks; naive agents received one sentence and no schema. Three of four found idle burn.

3Skilled-Sonnet answered correctly for $0.19 in 15 tool calls. Naive-Sonnet cost $0.55, produced a fluent, well-formatted answer — and its trace shows zero queries against CREDITS_ATTRIBUTED_COMPUTE. It never computed the idle figure.

01The problem

One question. Four answers.

The question: “Why is COMPUTE_WH so expensive?” Ground truth: 43.86 credits metered, 98.92% idle. Three of four agents found idle burn. Naive-Sonnet wrote a confident, well-formatted answer built on the wrong model — query volume drives cost, not idle time. Its recommendations would have sent engineers chasing cache hit rates while the warehouse kept billing for being awake.

02Why it matters

No formatting tell. No hedging tell.

Naive-Sonnet’s output reads identically to skilled-Sonnet’s. Cheap-and-naive is the configuration you can accidentally deploy because it looks correct and it is cheap.

03Check your own account

Metered vs attributed. One query.

The query naive-Sonnet never ran. Run this on your own account — any warehouse with a high pct_idle is billing for being awake, not working.

Idle vs working, per warehouseACCOUNT_USAGEsql
WITH metered AS (
  SELECT WAREHOUSE_NAME, SUM(CREDITS_USED) AS credits
  FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
  WHERE START_TIME >= DATEADD('day',-30,CURRENT_TIMESTAMP()) GROUP BY 1),
attributed AS (
  SELECT WAREHOUSE_NAME, SUM(CREDITS_ATTRIBUTED_COMPUTE) AS credits
  FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_ATTRIBUTION_HISTORY
  WHERE START_TIME >= DATEADD('day',-30,CURRENT_TIMESTAMP()) GROUP BY 1)
SELECT m.WAREHOUSE_NAME,
       ROUND(m.credits,2) AS metered_credits,
       ROUND(COALESCE(a.credits,0),2) AS query_credits,
       ROUND(100*(1 - COALESCE(a.credits,0)/NULLIF(m.credits,0)),1) AS pct_idle
FROM metered m LEFT JOIN attributed a USING (WAREHOUSE_NAME)
ORDER BY metered_credits DESC;
Want the full traces?
The long version adds three things this short can’t.
  • All four cells, line by line. What each configuration queried and why skilled-frontier hit 13 errors vs skilled-Sonnet’s 2.
  • Three laws, grounded in data. Where playbook and model substitute for each other — and where they stop.
  • The gap analysis naive-Opus discovered. 37,399 gaps, 120 exceeding 60 s (0.32%) — the finding no playbook run reached.
Sources
D
writes Crosshire Journal · crosshire.ch · June 2026
Two-minute field fixes from the same audits as our long-form Journal. One number, one fix, one result you can verify.
Crosshire Quick
© 2026 Crosshire Journal · Made in EU Privacy Terms Cookies License Imprint Coffee