Skills vs model capability: the 2×2, measured.
One directed question — “Why is COMPUTE_WH so expensive?” — sent to four agent configurations. Ground truth first: 43.86 credits metered, 98.92% idle, 0.4751 attributed to query work. The cheapest correct answer cost $0.19. The confident wrong answer cost $0.55 and contained zero queries against the attribution column. Three laws fall out of the traces — and the third one is the one that keeps you up at night.
15 tool calls · 248 s · 2 errors
naive-Sonnet: $0.55, confident, wrong
| Configuration | Cost | Tool calls | Duration | Errors | Result |
|---|---|---|---|---|---|
| Skilled-Sonnet (trace 213316) | $0.1885 | 15 | 248 s | 2 | Correct — cheapest, fastest |
| Skilled (frontier) (trace 212343) | $0.4934 | 26 | 402 s | 13 | Correct — culprit vague |
| Naive-Opus (trace 212903) | $0.6992 | 24 | 343 s | 8 | Correct & deepest root cause |
| Naive-Sonnet (trace 214051) | $0.5504 | 29 | 471 s | 6 | Confident — wrong |
- The question and the ground truth
- The grid, cell by cell
- Three laws from four traces
- The confident wrong answer
- What the grid leaves open
- The $15 lab — the anchor finding and the 2×2 design
- Skills vs model — the directed question and three laws
- Union beats everyone — the open-ended sweepsoon
- Variance and budget death — traces as refereesoon
- The operating model — prompt, tiers, economicssoon
- Account
- A seeded Crosshire trial Snowflake account, not a customer
- Window
- 16–20 May 2026, ~4 days of metered data (47,420 queries, 39 warehouses)
- Question
- “Why is COMPUTE_WH so expensive?” — directed, four configurations, same turn budget
- Cheap tier
- Sonnet — there are no Haiku traces. The cheap cell is Sonnet throughout this series.
- Cost basis
- $3.00/credit Snowflake list rate, disclosed as an estimate
43.86 credits, 98.92% idle. The agents didn’t know that yet.
Before any agent ran, the ground truth was established by hand against
the DuckDB. COMPUTE_WH metered 43.86 credits over four
days — 53.2% of the account’s warehouse credits. Of those,
0.4751 were attributed to actual query work by
QUERY_ATTRIBUTION_HISTORY. The remainder —
98.92% idle — was a warehouse billing for being
awake. The mechanism, established in Part 1: a notebook container service
with AUTO_SUSPEND_SECS = 0 polling around the clock, 37,399
inter-query gaps on COMPUTE_WH, only 120 of them (0.32%) exceeding the
60-second auto-suspend window. The warehouse never slept.
That is the target. Four configurations were then asked the same question: “Why is COMPUTE_WH so expensive?” Skilled agents get the database schema, unit conventions, and five investigative playbooks. Naive agents get one sentence. The cheap tier is Sonnet; the frontier tier is Opus — except for one trace (212343) where the model was set to “default” and the exact model version was not recorded. That run is labeled skilled (frontier) in the table and throughout this note; calling it Opus would assert something the trace does not confirm.
Four cells. Three correct answers. One fluent wrong one.
Skilled-Sonnet (trace 213316): $0.1885, 15 tool calls, 248 seconds, 2 errors. Correct. The playbook’s opening move — compare metered credits to attributed credits, then identify the query types keeping the warehouse awake — is exactly the causal model needed. The cheap model didn’t need to derive how Snowflake billing works from first principles. It needed to execute a competent investigation of a well-understood domain, and it did. Minor unit wobbles in savings estimates appear in the trace, consistent with the model executing playbook steps rather than reasoning from scratch. The production consequence: for recurring, well-understood question types, this cell is the answer. Fifteen tool calls. Under $0.20.
Skilled (frontier) (trace 212343): $0.4934, 26 tool calls, 402 seconds, 13 errors. Correct — the idle burn identified, the polling behaviour surfaced — but the culprit stayed vague. The skilled run’s playbook ends at “polling operations.” The frontier model executing that playbook stops where the playbook stops. It found the right answer faster than the naive frontier run but not more deeply. At 2.62× the cost of skilled-Sonnet ($0.4934 / $0.1885) and with 13 errors vs 2, this is the cell where you are paying for model capability and getting playbook execution.
Naive-Opus (trace 212903): $0.6992, 24 tool calls, 343 seconds, 8 errors.
Correct, and the deepest root cause of any run. With no playbook to constrain its
path, the frontier model did what no skilled run did: it pulled the most-repeated
query texts and read them. There sat
SYSTEM$NOTEBOOK_CONTAINER_PARAMS — the notebook-container
heartbeat, one level deeper than my playbook’s vocabulary. It then proved
the mechanism statistically: of 37,399 inter-query gaps, only 120 (0.32%) exceeded
the 60-second suspend window. My playbook does not contain that gap analysis,
because I had not thought to write it before this run. The unconstrained frontier
run is how the playbooks grow.
Naive-Sonnet (trace 214051): $0.5504, 29 tool calls, 471 seconds, 6 errors. Wrong. This is the cell to examine.
What the playbook buys, what the model buys, and where they stop.
Law 1: Skills substitute for model capability on known questions. Skilled-Sonnet delivered essentially the same answer as the frontier skilled run at 38% of the cost ($0.1885 / $0.4934 = 38.2%). The playbook’s first move is the causal model: compare metered to attributed credits; a large gap means idle burn; then identify what keeps the warehouse awake. The cheap model does not need to derive that from domain knowledge it may or may not have internalized. It executes the investigation competently, including arithmetic the playbook never spells out. For recurring, known question types, the production implication is direct: you do not pay frontier rates. You pay Sonnet rates.
Law 2: Model capability substitutes for skills on unknown questions. Naive-Opus had no map so it read the territory. It pulled query texts and found the notebook heartbeat function. It ran a gap distribution the playbook had not imagined. It discovered the mechanism one level deeper than the playbook’s vocabulary. The uncomfortable symmetry: both skilled runs — cheap model and frontier model alike — stopped at exactly the same depth. The depth where the playbook stops. Skills encode the known frontier; they cap discovery there, because you cannot write playbooks beyond what you already know. The unconstrained frontier run is not a control group. It is how the playbooks get updated.
Law 3: Cheap-and-naive produces confident wrong analysis with no formatting tell. This is the one that matters for deployment decisions, so the next section covers it fully.
Facts, no causal framework, no formatting tell.
Naive-Sonnet found real facts. The overwhelming majority of queries on COMPUTE_WH were repeats. Metadata operations dominated. It even speculated “likely from notebooks or BI tools.” It brushed against the right answer. Then it assembled those facts into a wrong causal model: each query incurs a minimum credit cost; volume is the main driver; large scans are secondary. Its recommendations — query caching, WHERE clauses — would have consumed an engineering sprint while the warehouse kept burning.
The damning detail is in the trace. Naive-Sonnet ran zero queries against the credit-attribution column. It never compared metered credits to attributed credits. It never computed the 98.92% idle figure. It invented a billing model where query volume costs money, because it lacked the domain knowledge to arrange the facts it found — and nobody had supplied it.
Note what this failure is not. It is not hallucination. It is not incompetence at SQL. Every number in its report is real. The failure is facts without a causal framework, delivered with the same fluent confidence and the same formatting as the correct answers in the other three cells. There is no hedging tell. There is no low-confidence qualifier. The only way to catch it is to know the domain well enough to check the claim — or to read the trace.
The trace is the only witness that doesn’t lie. Naive-Sonnet’s trace
shows 29 tool calls and not one of them touches
CREDITS_ATTRIBUTED_COMPUTE. Every dispute about what an agent
actually did reduces to this: grep the JSONL.
QUERY_ATTRIBUTION_HISTORY returned 0.4751 attributed credits, and 1 − (0.4751 / 43.86) = 0.9892. The agent read that number; a person checked the arithmetic and confirmed the denominator. That step is not optional. Law 3 exists precisely because it is skippable.A frame that made this concrete for several colleagues: the playbook is a senior engineer’s knowledge, written down. Skilled-Sonnet is a junior analyst with an excellent runbook — fast, cheap, reliable inside the runbook’s borders. Naive-Opus is a brilliant outside consultant with no context — expensive, slower, occasionally finds the thing nobody internal could. Naive-Sonnet is a junior analyst with no runbook working a domain they do not understand — and the analysis looks exactly like the correct one. Nobody staffs the last configuration knowingly. The grid says: with agents, you can deploy it by accident, and the per-run cost will tempt you to.
A directed question is the easy case. Two harder ones remained.
The directed-question setup is a controlled test — there is a known answer to converge on, and “correct” has a crisp definition. Real audits rarely work this way. Two harder questions followed this grid and each one changed what I thought I knew.
The first: what happens on a fully open-ended sweep — “find everything” — where there is no single answer and coverage matters more than depth? The second: how stable are these results run to run? On a directed question, the agents converged. On an open sweep, they did not. And on the stability question, I was wrong about something I had written down in my own analysis. Both of those are in Part 3.
One open verification item from this grid: the trace labeled 212343 ran with
--model default. What model “default” resolved to at
runtime is not recorded in the JSONL. The cost ($0.4934 for 26 calls, 402 seconds)
is consistent with a frontier-tier model, but the exact model ID is unknown. That
cell is labeled “skilled (frontier)” throughout this series rather than
“skilled-Opus,” and the distinction matters when the series reaches the
economics in Part 5.
- The open-ended sweep results. Covered in Part 3: which cells dominated, which findings appeared in only one, and what the union of all analysts — human included — got wrong.
- Run-to-run variance. The directed question is stable. The open sweep is not. Part 4 covers what changed between two naive-Opus sweeps and why the trace settled the disagreement.
- The full economics. Skilled-Sonnet at $0.19 per question, skilled-Opus audit at $1.77, the idle burn run-rate in dollars. Part 5.
- Crosshire Journal · The $15 lab: what four AI agents found in one Snowflake account — Part 1: the lab design, the anchor finding, the 2×2 setup
- Crosshire Journal · Your Snowflake login count is probably lying to you — the zero-MFA and DATABRICKS_READER signals this account also raised
- Crosshire Journal · Adaptive vs classic warehouses — how the two billing models differ; the adaptive warehouses in this account appear in Part 3
- Snowflake docs · QUERY_ATTRIBUTION_HISTORY — the table naive-Sonnet never queried; the column that proves idle burn
- Snowflake docs · WAREHOUSE_METERING_HISTORY — metered credits per warehouse; the denominator in the idle-pct calculation
Numbers in this note come from a seeded Crosshire trial Snowflake account — not a customer — snapshot 16–20 May 2026, approximately four days of metered data, re-derived against the same DuckDB the agents queried. Trace figures (cost, tool calls, duration, error counts) come from the nine on-disk JSONL files. Credit-to-dollar figures use the $3.00/credit Snowflake list rate, disclosed as an estimate. The model read every number; a human ratified every one before it shipped. — Crosshire
$0.19 correct. $0.55 confident. Wrong.
Four agents, one Snowflake question, same account. The cheapest correct answer cost $0.19. The confident wrong answer cost $0.55 — and its trace contains zero queries against the attribution column.
naive-Sonnet: $0.55 · confident · wrong
0 attribution queries in the naive-Sonnet trace
1Ground truth established by hand against the DuckDB: COMPUTE_WH metered 43.86 credits over four days, 0.4751 attributed to query work by QUERY_ATTRIBUTION_HISTORY — 98.92% idle.
2Four configurations asked the same directed question: “Why is COMPUTE_WH so expensive?” Skilled agents carried the database schema and five investigative playbooks; naive agents received one sentence and no schema. Three of four found idle burn.
3Skilled-Sonnet answered correctly for $0.19 in 15 tool calls. Naive-Sonnet cost $0.55, produced a fluent, well-formatted answer — and its trace shows zero queries against CREDITS_ATTRIBUTED_COMPUTE. It never computed the idle figure.
One question. Four answers.
The question: “Why is COMPUTE_WH so expensive?” Ground truth: 43.86 credits metered, 98.92% idle. Three of four agents found idle burn. Naive-Sonnet wrote a confident, well-formatted answer built on the wrong model — query volume drives cost, not idle time. Its recommendations would have sent engineers chasing cache hit rates while the warehouse kept billing for being awake.
No formatting tell. No hedging tell.
Naive-Sonnet’s output reads identically to skilled-Sonnet’s. Cheap-and-naive is the configuration you can accidentally deploy because it looks correct and it is cheap.
Metered vs attributed. One query.
The query naive-Sonnet never ran. Run this on your own account — any warehouse
with a high pct_idle is billing for being awake, not working.
WITH metered AS ( SELECT WAREHOUSE_NAME, SUM(CREDITS_USED) AS credits FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY WHERE START_TIME >= DATEADD('day',-30,CURRENT_TIMESTAMP()) GROUP BY 1), attributed AS ( SELECT WAREHOUSE_NAME, SUM(CREDITS_ATTRIBUTED_COMPUTE) AS credits FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_ATTRIBUTION_HISTORY WHERE START_TIME >= DATEADD('day',-30,CURRENT_TIMESTAMP()) GROUP BY 1) SELECT m.WAREHOUSE_NAME, ROUND(m.credits,2) AS metered_credits, ROUND(COALESCE(a.credits,0),2) AS query_credits, ROUND(100*(1 - COALESCE(a.credits,0)/NULLIF(m.credits,0)),1) AS pct_idle FROM metered m LEFT JOIN attributed a USING (WAREHOUSE_NAME) ORDER BY metered_credits DESC;
- All four cells, line by line. What each configuration queried and why skilled-frontier hit 13 errors vs skilled-Sonnet’s 2.
- Three laws, grounded in data. Where playbook and model substitute for each other — and where they stop.
- The gap analysis naive-Opus discovered. 37,399 gaps, 120 exceeding 60 s (0.32%) — the finding no playbook run reached.
- Snowflake docs · QUERY_ATTRIBUTION_HISTORY — the table naive-Sonnet never queried;
CREDITS_ATTRIBUTED_COMPUTEis the column that proves idle burn - Crosshire Journal · The $15 lab: what four AI agents found in one Snowflake account — Part 1: the lab design, the anchor finding, and the 2×2 setup this experiment runs on