Prompts as engineering, tiers as economics.
The resource monitor showed 0.82 credits used. The account had actually spent ~82. The discrepancy was not a bug — the monitor was created three days into the window and only counts from birth. One timestamp the agent did not check. One rule added to the prompt. That is the operating model in miniature: every failure becomes a rule, every rule compresses the next cycle. The engineered prompt that followed ran skilled-Opus through 125 tool calls, 14 minutes, $1.77 — completed, cache-units bug absent, every formerly-missed dimension covered.
43.39 idle credits × ~7.5 windows/month × $3.00/credit
$976.28 — at list rate, disclosed as an estimate
| Item | Cost | Trace |
|---|---|---|
| Playbook question — skilled-Sonnet | $0.19 | 213316 |
| Full audit — skilled-Opus, 125 calls, 14 min | $1.77 | 234349 |
| Exploration sweep — naive-Opus, open-ended | $0.76 | 234345 |
| Exploration sweep — naive-Opus, full audit prompt | $3.97 | 235655 |
| 9 on-disk traces, entire experiment | $8.43 | all |
| Idle-burn waste found — run-rate | ~$980/mo | see §B |
- The engineered prompt — rules and the failures behind them
- The TRIAL_GUARD anecdote — adjudication as standing discipline
- Three tiers — deterministic, skilled, naive
- The ratchet — how the floor rises
- The economics — run-rate arithmetic, full table
- The $15 lab — the anchor finding and the 2×2 design
- Skills vs model — the directed question and three laws
- Union beats everyone — the open-ended sweep
- Variance and budget death — traces as referee
- The operating model — prompt, tiers, economics
- Account
- A seeded Crosshire trial Snowflake account, not a customer
- Window
- 16–20 May 2026, ~4 days of metered data; monitor TRIAL_GUARD created 20 May 13:55
- Traces
- 9 JSONL files from the Claude Agent SDK; costs from trace metadata, not estimated
- Cost basis
- $3.00/credit Snowflake list rate, disclosed as an estimate; run-rate uses ~7.5 four-day windows/month
Each rule in the prompt traces to a failure.
The four earlier parts of this series produced a catalog of things that went wrong: a cache column whose units the agent misread, a security dimension one configuration never examined, budget death that wiped a run’s own findings, a monitor reporting a number that was technically correct and completely misleading. Every one of those became a line in the audit prompt used for the final runs.
Six mandated dimensions. Cost, including total-versus-attributed credits for every warehouse. Performance, including queuing, disk spill, compile ratios, cache hit rates, and partition pruning. Storage. Security and non-default account parameters. Config-versus-behavior contradictions. Cross-cut comparisons across warehouses and time windows. The dimensions name where to look; they do not name what to find. An agent that only confirms known issues is not running a discovery test — it is running a checklist. These dimensions keep it a discovery test.
Unit verification, stated before use. The cache hit rate column is a fraction from 0 to 1, not a percentage from 0 to 100. An earlier run read it as a percentage and reported a figure that looked plausible and was wrong by a factor of 100. The prompt now reads: “Before using ANY numeric column, verify its units and scale by sampling min/max/avg. State the unit you concluded. Never assume milliseconds vs seconds, or fraction vs percent.” In the final skilled-Opus run, the trace shows it sampling the cache column’s range before computing anything from it. The bug did not recur.
Independent verification of every headline number. One query is an assertion. Two independent queries reaching the same number is evidence. The prompt requires a second, independent check on any finding before it is written up.
Baselines read first. If tables named
baseline_* exist, read them before running a single
discovery query. Treat their contents as known; report only
deviations. This is the ratchet’s read-side: findings that
survived previous adjudication are not rediscovered at full
exploration cost every cycle. Known findings get checked, not hunted.
Append findings immediately. Each finding goes to disk the moment it is confirmed, not held until the end of the run. The failure that made this rule necessary is documented in Part 4: a run that exceeded its turn budget mid-investigation, never reached the write step, and took its own findings with it when it stopped. Immediate append means an interrupted run loses at most the current in-flight hypothesis.
Label every finding. “Confirmed in data” or “hypothesis — needs human context.” Describe behavior, never attribute intent. A 12-finding cap, ranked by materiality, plus a list of every check that came back clean. The clean-check list is what makes “nothing found” an inspectable claim. Without it, the absence of findings is unfalsifiable.
The skilled-Opus run under this prompt: 125 tool calls, 14.2 minutes, $1.77, completed. The trace shows it executing cross-cut comparisons it had previously skipped. It covered every formerly-missed dimension.
The monitor said 0.82. The account spent ~82.
TRIAL_GUARD, the account-level resource monitor, showed 0.82 credits used when the snapshot was taken. The account had actually spent roughly 82 credits over the window on warehouse metering alone. The agent flagged this correctly as a contradiction: config and observed behavior were disconnected.
The label on that finding was “hypothesis — needs human context.” One query resolved it: TRIAL_GUARD was created on 20 May 13:55, near the end of the data window. Resource monitors only count credits from their creation date forward. The monitor had been awake for less than a day while the account had been spending for four. The 0.82 figure was technically accurate and completely useless as a governance control.
The monitor was not broken. It was doing what monitors do — counting from birth. The governance failure was creating it three days after the spending started.
One adjudicated hypothesis. One timestamp checked. One rule added to the prompt: “When config and behavior contradict each other, check the object’s creation date before concluding malfunction.” This is what the operating cycle looks like at the level of a single finding: an error the agent made correctly became a permanent guardrail. One cycle. One rule. The floor went up.
Deterministic, skilled, naive — each has a job.
The lab produced three operating tiers. They are not stages of maturity — you run all three in parallel, because they answer different questions.
Tier 1: deterministic baselines. Free. Every finding
that survives adjudication crystallizes into plain SQL with numeric
alert thresholds: baseline_idle_burn,
baseline_queuing, baseline_spill. No LLM.
These run on every data refresh and flag deviations automatically.
Known problems get caught continuously because catching them no longer
requires intelligence. The “how often should agents run”
debate dissolves for known issues: detection is continuous and costs
zero regardless.
Tier 2: skilled agent, cheap model. ~$0.19 per question. Playbook questions on demand, plus Tier-1 alerts that need investigation. The skills carry the causal model; the cheap model executes it. The verified cost for a directed question on this account, skilled-Sonnet, was $0.19 (15 tool calls, 248 seconds, 2 errors). Verify its arithmetic — unit confusion is its known failure mode and it has no playbook telling it to check. For known questions with a playbook, it matches the frontier model at a fraction of the cost.
Tier 3: naive exploration, frontier model. ~$0.76–3.97 per run. Periodic and fully unanchored. The temptation is to feed it the known-findings list so it does not rediscover what you already know. Resist it. Telling the explorer what is known re-installs the ceiling you are paying to escape; it anchors the search to your map instead of the data. Let it run blind. Deduplicate its output against the known-findings ledger afterward, in code or one cheap-model pass. Rediscovery costs tokens. Anchoring costs discoveries. Tokens are the cheaper resource.
Cadence for Tier 3 is discovery-rate-driven, not calendar-driven. While runs keep yielding accepted novel findings, maintain the frequency. As the rate decays, stretch the interval. Reset on change events — new workloads, incidents, releases — because your coverage was of the old system. Tier-3 runs need 2–3× the turn budget of a directed run, plus mandatory checkpoint writes. Part 4 explains what happens when you omit the budget headroom.
Discovery becomes baseline. The floor rises.
The three tiers are connected by one mechanism. Tier 3 discovers something new. A human accepts it as real. It becomes a Tier-1 baseline with an alert threshold, and, where it has a reusable investigation method, a Tier-2 skill. The floor of competence rises continuously. The unanchored scout — Tier 3 — keeps it from ever becoming a fence.
Skills are the floor of competence, never the ceiling of curiosity.
Supporting structure makes the ratchet durable. Traces on everything: not for debugging, but as the ground truth for disputes — including disputes with yourself. An earlier claim in this series was settled by grepping a trace that recorded the prompt verbatim; the trace showed which version of the prompt had actually run. Hooks for hard guardrails: read-only database access enforced at the tool level, trace-directory location outside the working directory so the naive agent cannot read past outputs. Contradiction adjudication as a standing step: when two agents disagree, or when an agent disagrees with a prior human claim, the disagreement is the product. It means one of you is wrong, and SQL is the referee.
The TRIAL_GUARD error above is the ratchet in its smallest form. One hypothesis, one timestamp, one rule. In a running account with weekly sweeps, that mechanism compounds: the baseline library grows, the skill set deepens, and the Tier-3 explorer keeps surfacing what the library has not yet seen.
$1.77 to find it. ~$980 a month to leave it.
The idle-burn run-rate arithmetic is straightforward, so show it rather than assert it. COMPUTE_WH ran 43.39 idle credits over the four-day window. A month has roughly 7.5 such windows (30 days / 4 days). At the $3.00/credit list rate — a disclosed estimate, not a contract figure — that is:
43.39 × 7.5 × $3.00 = $976.28 per month COMPUTE_WH idle credits · ledger §B · $3/credit estimate
That 43.39 is not an assertion either. It is metered credits minus attributed credits, per warehouse — the same query that opened the series, runnable on your own account in one statement:
WITH metered AS ( SELECT WAREHOUSE_NAME, SUM(CREDITS_USED) AS credits FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY WHERE START_TIME >= DATEADD('day',-30,CURRENT_TIMESTAMP()) GROUP BY 1), attributed AS ( SELECT WAREHOUSE_NAME, SUM(CREDITS_ATTRIBUTED_COMPUTE) AS credits FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_ATTRIBUTION_HISTORY WHERE START_TIME >= DATEADD('day',-30,CURRENT_TIMESTAMP()) GROUP BY 1) SELECT m.WAREHOUSE_NAME, ROUND(m.credits,2) AS metered_credits, ROUND(100*(1 - COALESCE(a.credits,0)/NULLIF(m.credits,0)),1) AS pct_idle FROM metered m LEFT JOIN attributed a USING (WAREHOUSE_NAME) ORDER BY metered_credits DESC;
Call it roughly $980 a month. The cost of finding it: $1.77. The ratio is about 550:1. One finding, found in the first full run, repaid the entire experiment many times over. The interesting question is not whether it was worth running the agent. It is what else is sitting in accounts where this arithmetic has never been done.
The nine traces on disk sum to $8.43. That is the entire lab: four agent configurations, two models, directed questions, open-ended sweeps, the full audit prompt. The series reported “under $15” as an informal record; $8.43 is what the trace files show. Both are true of different denominators: $8.43 covers what was logged; under $15 covers the full run including cheap directed queries not kept as files. Neither figure is a model estimate; both come from trace metadata or the author’s direct record.
The binding constraint in this entire exercise was not tokens. It was human attention: reviewing reports, adjudicating contradictions, ratifying numbers. The finding caps, the clean-check lists, the hypothesis labels, the post-hoc deduplication — every structural feature of the prompt exists to spend that attention wisely, not to constrain the agent.
| Run | Config | Calls | Duration | Cost |
|---|---|---|---|---|
| 213316 | Directed question, skilled-Sonnet | 15 | 248 s | $0.19 |
| 234345 | Open-ended sweep, skilled-Opus | 31 | 575 s | $0.76 |
| 234349 | Full audit, skilled-Opus — engineered prompt | 125 | 852 s | $1.77 |
| 235655 | Full audit, naive-Opus | 88 | 1,526 s | $3.97 |
| All 9 traces | Entire lab | — | — | $8.43 |
| Idle-burn waste found — estimated monthly run-rate | ~$980/mo | |||
One paragraph to close the series. Playbooks carry the causal model; models carry the execution; frontier runs grow the playbooks; traces referee all of it. A cheap model with a good playbook matches the frontier model on known questions at a fraction of the cost. A frontier model with no playbook discovers what no playbook contains — including your own mistakes — but needs budget headroom and checkpoints to survive open-ended work. A cheap model with no playbook is the trap: fluent, confident, wrong, and priced to tempt you. No single analyst, human included, beat the cross-examined union of several. Build the ratchet, keep one unanchored scout running, and treat every contradiction as the system handing you your next improvement, with evidence attached.
- Export your
ACCOUNT_USAGEviews to CSV — query history, warehouse metering, query attribution, storage, logins, users, resource monitors, account parameters. A few days’ window suffices. Redact query-text literals on the way out. - Load into DuckDB with a small Python script. Every agent question is now free — no warehouse spins for the agent’s curiosity.
- Build two Claude Agent SDK agents: one with the schema and investigation playbooks as skills; one with a single sentence, in an isolated directory, traces written outside the working directory, global plugins off. Ask both for everything, with the engineered prompt. Diff. Adjudicate every contradiction with SQL. Watch which of your own “clean” checks fail.
- Crosshire Journal · The $15 lab: what four AI agents found in one Snowflake account — Part 1 of this series: the anchor finding, the 2×2 design, and how the lab was built
- Crosshire Journal · Your Snowflake login count is probably lying to you — the zero-MFA and login-inflation findings this account also surfaced, covered in depth
- Crosshire Journal · Crosshire Audit: your warehouse, with receipts — the in-browser diagnostic this method feeds; the published numeric source of truth behind the findings
- Snowflake docs · RESOURCE_MONITORS view —
CREATEDcolumn; why a monitor’s “used” figure only counts from its creation date - Snowflake docs · WAREHOUSE_METERING_HISTORY —
CREDITS_USEDvsCREDITS_ATTRIBUTED_COMPUTEin QUERY_ATTRIBUTION_HISTORY: the split that makes idle visible
Numbers in this note come from a seeded Crosshire trial Snowflake account — not a customer — snapshot 16–20 May 2026, about four days of metered data. Trace costs are from Claude Agent SDK usage metadata recorded in the nine on-disk JSONL files. Credit-to-dollar figures use the $3.00/credit Snowflake list rate, disclosed as an estimate. The run-rate arithmetic uses ~7.5 four-day windows per month. A model never produced a number in this series; a human ratified every one before it shipped. — Crosshire
The resource monitor said 0.82. The account spent ~82.
TRIAL_GUARD was created three days into the data window and only counts from birth. The engineered prompt that caught this ran skilled-Opus through 125 tool calls for $1.77. The idle waste it confirmed runs at ~$980 a month.
43.39 idle credits × 7.5 windows × $3.00/credit
found for $1.77 in one agent run
1TRIAL_GUARD showed 0.82 credits used. The account had spent ~82. The agent labeled it a hypothesis. One timestamp resolved it: the monitor was created 20 May 13:55, near the end of the window.
2That hypothesis became a prompt rule: “check creation date before concluding malfunction.” Each of the six other rules in the engineered prompt traces to a different past failure — a unit misread, a budget-death run, a missed dimension. Prompts as engineering.
3Skilled-Opus ran the full audit under that prompt: 125 tool calls, 14.2 minutes, $1.77. Every formerly-missed dimension covered. Idle waste confirmed at ~$980/mo run-rate. The cache-units bug did not recur.
A monitor that only counts from birth.
Resource monitors in Snowflake track credit usage from the date they are created — not from the start of the billing period. TRIAL_GUARD was created on 20 May 13:55, near the end of a four-day window where the account had spent roughly 82 credits on warehouse metering. The monitor reported 0.82 credits used. Both numbers were accurate. Neither told you anything useful about the actual spend.
Every failure becomes a rule. Every rule costs less next time.
The cache-column unit error in an earlier run — the agent read a fraction (0–1) as a percentage, reporting a figure wrong by a factor of 100 — became one line in the engineered prompt: verify units before use, state the unit you concluded. That is prompts as engineering: a failure catalog compressed into standing instructions. The skilled-Opus run under the engineered prompt cost $1.77 and covered every formerly-missed dimension; the naive-Opus run under the same prompt cost $3.97 with no playbook carrying the causal model. The $0.19 skilled-Sonnet directed run matched the frontier model on known questions. Tier economics fall out of that arithmetic: the playbook determines cost at least as much as the model does.
$1.77 to find it. ~$980 a month to ignore it.
COMPUTE_WH ran 43.39 idle credits in four days. Multiply by 7.5 windows per month, then by the $3.00/credit list rate (a disclosed estimate): $976.28 per month. The full lab — nine traced runs, two agents, two model tiers — cost $8.43 in total. One finding paid for the experiment many times over. The query that proves the idle split is in Part 1.
- The engineered prompt, rule by rule — every line traces to a past failure.
- Three tiers, verified costs — $0 baselines, $0.19/question Sonnet, $0.76–$3.97 Opus runs.
- The ratchet — how Tier-3 finds become Tier-1 checks and Tier-2 skills.
- Crosshire Journal · The $15 lab: what four AI agents found in one Snowflake account — Part 1: anchor finding, metered-vs-attributed query, 2×2 design
- Snowflake docs · RESOURCE_MONITORS —
CREATEDcolumn and credit-counting behavior