The $15 lab: what four AI agents found in one Snowflake account.
A notebook I forgot to close burned 43.86 credits on one
warehouse in four days — 98.9% idle. An AI agent found
it for $1.77 and named the function responsible, a
notebook-container heartbeat called
SYSTEM$NOTEBOOK_CONTAINER_PARAMS. That finding is the
least interesting thing here. The lab around it cost under $15, and
what it measured contradicts both loud opinions about agentic
analysis: you need a framework and just prompt a frontier
model.
98.9% idle · COMPUTE_WH · 4 days
43.86 credits metered, 0.47 of real query work
| COMPUTE_WH, 4 days | Credits |
|---|---|
| Metered (billed) | 43.86 |
| Attributed to query work | 0.47 |
| Idle (unattributed) — 98.9% | 43.39 |
- A notebook nobody closed, found for $1.77
- The finding, with receipts
- Why the cheapness is the method
- The four components
- The experimental variable, and the 2×2
- The $15 lab — the anchor finding and the 2×2 design
- Skills vs model — the directed question and three lawssoon
- Union beats everyone — the open-ended sweepsoon
- Variance and budget death — traces as refereesoon
- The operating model — prompt, tiers, economicssoon
- Account
- A seeded Crosshire trial Snowflake account, not a customer
- Window
- 16–20 May 2026, ~4 days of metered data (47,420 queries, 39 warehouses)
- Lab
- 17
ACCOUNT_USAGECSVs → one local DuckDB → four agents on the Claude Agent SDK - Sources
WAREHOUSE_METERING_HISTORY·QUERY_ATTRIBUTION_HISTORY·QUERY_HISTORY·SERVICES· nine on-disk trace JSONLs- Cost basis
- $3.00/credit Snowflake list rate, disclosed as an estimate
An agent found it for $1.77. I never did.
A notebook I forgot to close burned 43.86 credits on
COMPUTE_WH in four days — 98.9% idle,
0.47 credits of actual query work. A skilled frontier agent found it,
named the function doing the polling
(SYSTEM$NOTEBOOK_CONTAINER_PARAMS, a notebook-container
heartbeat), and wrote it up for $1.77 in a 14-minute
run. I had been staring at the same account for a week and had not
seen it.
That discovery is the least interesting thing in this series. The lab around it is. The same investigation ran through four agent configurations — two model tiers, two levels of preparation — with every tool call traced and every claim adjudicated against SQL. The result contradicted both loud opinions in this space and converged on a third one neither camp emphasises: what matters is who carries the causal model — the model, or the playbook.
A warehouse bills for being awake, not for working.
The invoice shows one number per warehouse. It does not show you how much of that number was a warehouse doing nothing. The table at the top of this report splits COMPUTE_WH over the four-day window: 43.86 credits metered, 0.47 attributed to query work, the rest — 98.9% — idle.
COMPUTE_WH is an X-Small with AUTO_SUSPEND = 60s, and
it was the only warehouse still awake when the snapshot was taken.
The reason is one object: a notebook container service,
DARSHANSINGHCROSSHIRE_SERVICE_1, configured with
AUTO_SUSPEND_SECS = 0 — it never suspends. For
four days it polled around the clock: 3,982 calls
to SYSTEM$NOTEBOOK_CONTAINER_PARAMS and
18,683 file-poll operations
(GET_FILES plus LIST_FILES), at roughly
300–400 queries an hour, every hour, overnight included.
That cadence quietly defeats the 60-second auto-suspend. Of the 37,399 inter-query gaps on COMPUTE_WH, only 120 — 0.32% — were longer than 60 seconds. The notebook polled faster than the warehouse could fall asleep, so it billed 53 hours awake for 0.47 credits of real work. That one warehouse is 53.2% of the account’s warehouse credits — about half the account’s compute — spent on a heartbeat. At the $3.00/credit list rate (a disclosed estimate), left running unattended that idle burn compounds; the recurring waste is the subject of Part 5.
Every number in that table is real, and the warehouse was never broken. It did exactly what an always-on container tells it to do. The bill is one number; idle is not a line on it.
If reproducing it costs $15, people reproduce it.
The interesting claim is not that an agent found an idle warehouse. It is that the whole experiment — one account export, four agent configurations, full traces — ran for under $15 on my own Snowflake bill, with no warehouse spun up for any of the agent’s curiosity. The nine runs I logged sum to $8.43; the rest were cheap directed runs I did not keep. That is the entire cost of the lab.
The cheapness is the contribution I actually care about. A finding you cannot reproduce is an anecdote. If reproducing an experiment costs $15 and an afternoon, people reproduce it, argue with it, and improve it. That is the bar a field note should clear: not trust me, but here is the cheap rig, run it yourself.
One export, one DuckDB, two agents, every event traced.
One export. Seventeen CSVs from Snowflake’s
ACCOUNT_USAGE schema — query and warehouse
metering, query attribution, access history, storage, logins,
users, grants, resource monitors, account parameters. The snapshot:
47,420 queries, 39 warehouses, about four days of a
trial account, for one-time cents. Pointing an agent at the live
views instead would spin a warehouse on every curious query;
exporting once makes curiosity free.
One local DuckDB. A Python script loads the CSVs,
parses the timestamps, and flattens the access-history JSON into a
lineage table. From there every agent question costs zero Snowflake
credits, forever. One hygiene step is worth stealing: the export
runs query texts through AST-based literal redaction (with
sqlglot) so the value in every literal is replaced
with a placeholder. Metadata about a query is not the same as the
data inside it, and the export should enforce that line.
Two agents on the Claude Agent SDK. The SDK is the engine inside Claude Code exposed as a library: it gives you the full agentic loop — tool execution, context management, file access — and you supply a system prompt and options. The entire “framework” here is two roughly 80-line Python files. The framework question settles in one paragraph: for agent investigates database, the loop you would build with an orchestration framework is the loop the SDK already gives you, minus a dependency tree. Deterministic glue stays ordinary code; the agentic behaviour comes from the vendor harness; nothing in between needs a framework.
Full tracing. A thin wrapper around the SDK’s message stream writes every event to a JSONL file — the exact prompt, the system prompt, every tool call with the full SQL, every result, every error, the model’s thinking, the final cost — one object per line, flushed immediately so an interrupted run loses nothing. The trace is for epistemics, not debugging. Every dispute in this series — did the agent cheat, did a prompt edit take effect, did it ever read the attribution column — was settled by grepping a trace. Agent claims without traces are vibes.
Two levels of preparation, two model tiers, one grid.
The two agents differ in exactly one dimension.
Skilled gets the database schema in its system
prompt — table descriptions, join keys, unit conventions,
known gotchas — plus five SKILL.md playbooks that
encode how I would investigate: cost-and-idle analysis, impact
analysis, query antipatterns, credit-spike triage, and an
open-ended anomaly-discovery checklist.
Naive gets one sentence — here is a
Snowflake account export to analyze — no schema, no
playbooks, and (after a contamination incident that is the subject
of Part 4) an isolated temp directory so it cannot read the skills
off disk.
Cross that one variable with two model tiers — Opus (the frontier model) and Sonnet (the cheap one) — and you get a 2×2: four configurations, same database, same question, same turn budget, full traces. The cheap tier here is Sonnet, not the smallest model; the point was to compare a frontier model against a genuinely capable cheaper one, not against a toy.
| Opus · frontier | Sonnet · cheap | |
|---|---|---|
| Skilled · schema + 5 playbooks | The expected winner | Does the playbook rescue the cheap model? |
| Naive · one sentence | Can the frontier model compensate for no knowledge? | The expected loser |
The hypothesis I would have bet on: skilled-Opus wins everything, naive-Sonnet loses everything, and the diagonals — does the playbook rescue the cheap model, does the frontier model compensate for missing knowledge — are the only interesting cells. That hypothesis was half right. The wrong half is where every useful lesson in this series came from, and it starts with a cheap, naive agent that produced a confident, fluent, completely wrong answer.
That is the Crosshire audit method in a petri dish: a cheap, reproducible rig, every claim traceable to a query, and a person standing between the model’s prose and the published number. The published Snowflake login-count note asserts that discipline; this series is the experiment behind it. The agent also flagged the account’s zero-MFA posture and its adaptive-versus-classic warehouse billing — both covered in depth in those notes, so this series sends you there rather than re-treading them.
- The 2×2 on a directed question. Skilled and naive, Opus and Sonnet, all asked “why is COMPUTE_WH expensive?” — with the traces side by side.
- The three laws that fell out of it. What the playbook buys, what the frontier model buys, and where the two stop substituting for each other.
- The cell that was confident and wrong. The cheap, naive configuration that wrote a fluent answer with the wrong number — and why it is the most expensive cell to trust.
- Crosshire Journal · Your Snowflake login count is probably lying to you — the zero-MFA and DATABRICKS_READER material this account also surfaced
- Crosshire Journal · Crosshire Audit: your warehouse, with receipts — the in-browser diagnostic this method feeds
- Crosshire Journal · Adaptive vs classic warehouses — how the two billing models differ, in depth
- Claude Agent SDK · overview — the agentic loop the two ~80-line agents are built on
- Snowflake docs · ACCOUNT_USAGE reference —
WAREHOUSE_METERING_HISTORY,QUERY_ATTRIBUTION_HISTORY, and the rest of the export
Numbers in this note come from a seeded Crosshire trial Snowflake account — not a customer — snapshot 16–20 May 2026, about four days of metered data, re-derived against the same DuckDB the agents queried. Credit-to-dollar figures use the $3.00/credit list rate as a disclosed estimate. The model read every number; a human ratified every one before it shipped. — Crosshire
A notebook nobody closed. Found for $1.77.
An AI agent read one Snowflake account export and found a notebook left running — 98.9% of one warehouse’s credits, burned idle — for $1.77. Here is the finding and the query that proves it.
98.9% idle · COMPUTE_WH · 4 days
43.86 credits metered, 0.47 of real query work
1Exported one Snowflake account to CSV — 17 ACCOUNT_USAGE files, 47,420 queries, 39 warehouses, ~4 days — for one-time cents.
2Loaded the CSVs into a local DuckDB and pointed four AI agents at it. Every agent question cost zero Snowflake credits; the whole lab ran under $15.
3The skilled frontier agent found the idle notebook for $1.77 and named the function keeping it awake: SYSTEM$NOTEBOOK_CONTAINER_PARAMS.
A notebook left running, 98.9% idle.
COMPUTE_WH metered 43.86 credits over four days. Only 0.47 were attributed to query work; the other 98.9% — 43.39 credits — was idle, about half the account’s warehouse credits. The cause was a notebook container set to never auto-suspend, polling around the clock so the 60-second auto-suspend almost never fired.
The bill is one number. Idle isn’t a line on it.
The invoice shows total credits per warehouse. It does not split them into working and awake but doing nothing. A warehouse bills for being awake, not for working. The split only appears when you put metered credits next to attributed credits — which the query below does for every warehouse you have.
Metered vs attributed, one query.
Run this on your own account. Any warehouse with a high
pct_idle is billing for being awake; a service with
AUTO_SUSPEND set too low, or to zero, is the usual cause.
WITH metered AS ( SELECT WAREHOUSE_NAME, SUM(CREDITS_USED) AS credits FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY WHERE START_TIME >= DATEADD('day',-30,CURRENT_TIMESTAMP()) GROUP BY 1), attributed AS ( SELECT WAREHOUSE_NAME, SUM(CREDITS_ATTRIBUTED_COMPUTE) AS credits FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_ATTRIBUTION_HISTORY WHERE START_TIME >= DATEADD('day',-30,CURRENT_TIMESTAMP()) GROUP BY 1) SELECT m.WAREHOUSE_NAME, ROUND(m.credits,2) AS metered_credits, ROUND(COALESCE(a.credits,0),2) AS query_credits, ROUND(100*(1 - COALESCE(a.credits,0)/NULLIF(m.credits,0)),1) AS pct_idle FROM metered m LEFT JOIN attributed a USING (WAREHOUSE_NAME) ORDER BY metered_credits DESC;
- The 2×2 grid and its three laws. Skilled and naive, Opus and Sonnet, on the same account — what the playbook versus the frontier model each buys.
- The cell that was confident and wrong. A cheap, naive run that wrote a fluent answer with the wrong number.
- The operating model. The $15 rig, full tracing, and the two rules that turn it into a process.