Budget death & contaminations — the short version Full trace forensics, three contaminations, the referee

Tech·Agentic analytics·Part 4 of 5·10 min read·20 Jun 2026

67 tool calls. No report.

The naive-Opus agent ran 67 tool calls — trace 231251 — and never composed its report, killed by the SDK’s turn limit before the write-up turn arrived. A re-run of the open-ended sweep lost its own best finding. Three contaminations had been quietly invalidating the control group since the first run. Every dispute settled the same way: grep the trace.

Darshan Singh

Crosshire auditSnowflake · ACCOUNT_USAGEJune 2026

67 → 0

67 tool calls · trace 231251 · no result event
budget death via error_max_turns — never composed its report
completed sibling: trace 235655 · 88 calls · 25.4 min · $3.97

Two naive-Opus audit runs · same database · same question

Run (trace ID)	Tool calls	Duration	Cost	Outcome
231251 — interrupted	67	—	—	No result event. Budget death.
235655 — completed	88	25.4 min	$3.97	Full report composed

In this report

Exploration is a sample, not a census
Budget death — 67 calls, no report
Three contaminations
The trace as referee
Two discipline points that don’t flex

Series · Skills are the floor, not the ceiling

The $15 lab — the anchor finding and the 2×2 design
Skills vs model — the directed question and three lawssoon
Union beats everyone — the open-ended sweepsoon
Variance and budget death — traces as referee
The operating model — prompt, tiers, economicssoon

Provenance

67calls, no report

88calls, completed

$3.97completed run

3contaminations

$8.439 runs on disk

Account: A seeded Crosshire trial/demo Snowflake account, not a customer
Window: 16–20 May 2026, ~4 days of metered data (47,420 queries, 39 warehouses)
Key artifacts: Trace 231251 (interrupted, 67 calls, no result event) · trace 235655 (completed, 88 calls, 25.4 min, $3.97)
Lab total: 9 runs on disk, $8.43 total; “under $15” includes unlisted cheap directed runs not kept
Cost basis: $3.00/credit Snowflake list rate, disclosed as an estimate

Seeded trial account Numbers from traces Method-only public release

01Exploration is a sample

A re-run lost its own best finding.

After the open-ended sweep documented in Part 3, I re-ran the same naive-Opus configuration against the same database. The re-run came back faster. It made fewer tool calls. It produced a tidy report.

It had also lost the queuing crisis — the finding from the previous run that falsified my “clean” claim about the account. Not contradicted: simply never looked at. The agent looked elsewhere, found other things, and wrote those up instead. The re-run also introduced a fresh factual error: inverted thresholds on a resource monitor, notify and suspend swapped.

Same agent. Same data. Hours apart. Materially different coverage and a new mistake. The second run’s report read fine. The lesson is structural, not incidental: an exploration run is a sample, not a census. Unconstrained search is exactly that, and what it finds varies by run. Two rules fell out of this and became permanent process:

Two rules from the variance failure

A discovery counts once it recurs across runs or is verified deterministically. One exploration hit is evidence worth pursuing; it is not a confirmed finding.

Coverage you care about cannot be left to the agent’s mood. It must be either mandated in the prompt or crystallized into deterministic SQL that runs every time — so a finding, once accepted, can never be un-found on the next run.

Neither rule is special to Snowflake or to this particular agent. They apply to any unconstrained search over a large space with a variable budget. The trace made the failure visible; a re-run that agreed with the first would have hidden it permanently.

Note: the second sweep’s trace may not be among the nine on-disk files. The account above is the author’s recounted observation. Trace 231251 is the surviving artifact from the interrupted audit run described in the next section — a separate event. The argument stands on the recollection; the interrupted run is the traceable artifact.

02Budget death

67 calls. No result event. Vaporized.

For the full audit run I wrote a demanding prompt — the engineered version whose rules are the subject of Part 5. Skilled-Opus completed it: 125 tool calls, 14.2 minutes, $1.77, full report composed (trace 234349). Naive-Opus, given the same prompt and the same turn budget, hit the SDK’s turn limit — error_max_turns — before reaching the write-up turn.

That interrupted run is trace 231251: 67 tool calls, no result event in the JSONL. The console showed nothing. The output directory showed nothing. The interrupted run — whatever it had done — never presented it, because presentation was scheduled for a turn that never arrived. A separate completed naive-Opus audit (trace 235655) ran 88 calls, 25.4 minutes, $3.97 and did produce a report; that is a different run on a different day, not a recovery of 231251. Keep these two straight. The budget-death story belongs to 231251.

Two naive-Opus audit runs · tool calls · outcome interrupted vs completed

Darker bar = interrupted run (67 calls, no result event, trace 231251). Lighter bar = completed run (88 calls, 25.4 min, $3.97, trace 235655). These are two separate runs, not a before-and-after of the same session.

The mechanism is mundane, which is why it matters. The naive agent spends turns on things skills make free: discovering the schema, learning the data’s shape, fighting the environment. On Windows, multiline scripts piped through the shell are their own problem — both agents burned double-digit error counts on quoting until one of them, unprompted, wrote itself a small query-helper script. I now ship that helper in the project setup. The naive agent also batches fewer tool calls per turn. Under a fixed budget, knowledge doesn’t just make agents faster — past a certain task size, it decides whether they finish at all.

Two fixes, both one-liners. Exploration runs get 2–3× the turn budget of playbook runs. And the prompt orders the agent to append each finding to a file the moment it is confirmed — never batch the write-up for the end. When the interrupted run (231251) died, its checkpoint file survived in the working directory. Recovered, it contained a unit-calibration section, the baseline check, and — irony fully intended — the idle-burn computation that this same configuration had missed in every previous directed run. The work was not lost. It was just never presented, because presentation was scheduled for a turn that never arrived.

03Three contaminations

All three happened. All three will happen to you.

These are qualitative failures. No new numbers: the contaminations are structural and the fixes are one-liners each. I am listing them because every one of them happened to me before I noticed, which means they are the failures most likely to happen to you quietly.

The control group could read the answer key. The first “naive” agent ran in the project directory, where the skill playbooks live as ordinary markdown files and the agent has filesystem tools. It does not need the skills loaded into its system prompt to cat them; it only needs to be curious enough to look. The fix is simple: naive runs in an isolated temp directory containing only the database. If the skill files are not in the directory tree, the agent cannot reach them regardless of curiosity.

A user-level plugin leaked through the sandbox. Even after isolation, one naive run loaded a generic data-engineering skill from my user-level plugin cache. Project isolation does not block user-scope plugins; they inject into every agent on the machine. Two saving graces: the leaked skill contained zero Snowflake billing knowledge, and that run got the causal model wrong anyway — which if anything strengthened the result. But “my control group was accidentally stronger than designed and still failed” is a sentence you only get to write if you noticed. Fix: disable global plugins for control runs, and know they exist at all before assuming project isolation is complete.

Earlier traces sat readable in the working directory. When a cheap skilled run matched the expensive one suspiciously well, I checked whether it had simply read the previous run’s output. Trace forensics said no — every tool call accounted for, none touching any trace file. It had earned its answer. But nothing had prevented cheating except the agent not happening to look. “My experiment was valid by luck” is not methodology. Fix: traces write outside the agents’ working directory. The durable version of this fix is a PreToolUse hook that denies any tool call whose path touches the trace directory — prompts request behavior; hooks prevent it.

Three contaminations · what happened · fix

Contamination	What happened	Fix
Skills readable from cwd	Naive agent ran in project dir; could `cat` skill files from disk	Isolated temp dir, database only
User-level plugin leaked	Project isolation did not block user-scope plugin cache injection	Disable global plugins for control runs
Traces readable in cwd	Previous runs’ full answers sat in the working directory	Write traces outside cwd; `PreToolUse` hook for durable prevention

04The trace as referee

Every dispute settled the same way.

What turned each of these failures from a lurking doubt into a settled fact was the same artifact: the trace. One JSON line per event — the exact prompt, every SQL statement, every result, every error, the final cost. Written to disk immediately so an interrupted run loses nothing except the turns it did not reach.

The traces’ finest hour was a dispute with me on the losing side. I edited the agent’s default prompt to the new audit prompt, ran it, and got back a shallow report. I was certain the new prompt had run. The trace’s run_start event — which records the prompt verbatim — showed the old one-liner. A scan of all events found zero fragments of the new text anywhere in the file. The edit had landed in the wrong copy of the file. Thirty seconds of grep settled what could have been an hour of arguing with a confusing result, or worse, a wrong conclusion drawn from a run that was never asked to do the thing I thought I had asked.

The trace is ground truth for what the agent was actually asked, what it actually did, and what it actually cost. In an experiment, all three will be disputed — including by you. Agent results without traces are anecdotes.

The same artifact settled the contamination forensics. The question “did the cheap skilled run read the previous trace?” was not answerable from the report alone — the report looked correct. The trace showed every tool call accounted for, none touching the trace directory. That answer took under a minute to get. Without the trace, the ambiguity would have been permanent.

This is the argument for tracing everything, and it is not primarily about debugging. Debugging is a side benefit. The real argument is epistemic: an agentic investigation is not a deterministic computation. You cannot reconstruct what happened from the output alone. The trace is the only record of what was asked and what was found, and in a controlled experiment it is the only way to tell “the agent earned this answer” from “the agent read it from a previous run.”

05Two discipline points

The rules the lab does not relax.

The failures in this part are methodological. Budget death is a configuration gap. Contamination is a setup gap. Variance is a misunderstanding of what unconstrained exploration produces. Each has a fix that is a one-liner in configuration or a one-sentence addition to the prompt. None of them requires a different model or a different framework.

What they do require is the discipline that runs through this entire series. Stated plainly:

Two rules this lab does not relax

The model never produces a number. Every call count, every duration, every cost in this note comes from the trace JSONL — not from the agent’s prose summary of its own run. The model reads numbers and writes prose around them; it does not measure itself.

A human ratifies everything. No finding in this series ships because an agent asserted it. Each one is checked against the artifact that produced it before it appears here. The traces are the artifacts. The judgment of what is true stays with a person.

Part 5 is the payoff: the engineered audit prompt whose rules each map to a failure catalogued in this series — and what happened when it ran.

Next in the series · Part 5

The engineered audit prompt. Each rule traces to a specific failure in this series. Skilled-Opus under it: 125 tool calls, 14.2 min, $1.77, completed (trace 234349).
The three-tier operating model. Deterministic baselines (free) → skilled + cheap model (~$0.19/question) → naive frontier exploration. The discovery ratchet: Tier 3 finds, human accepts, becomes Tier 1 baseline.
The economics. $0.19 questions, $1.77 audits, and the idle-burn run-rate found by the same agent that failed here.

From our audit

Budget death, contaminated controls, and findings that evaporate on re-run are not edge cases — they are what running agentic experiments looks like before someone puts process around them. The Crosshire audit method is that process: one export, every claim sourced to a trace event, a human signing off each number before it lands. If you want to run this kind of investigation on your own Snowflake account, the same rig applies.

Start a conversation →

Sources & further reading

Crosshire Journal · The $15 lab: what four AI agents found in one Snowflake account — the series anchor: 2×2 design, the four configurations, the idle-burn finding that started it all
Crosshire Journal · Your Snowflake login count is probably lying to you — the zero-MFA posture the same account surfaced, audited separately
Crosshire Journal · Adaptive vs classic warehouses — how the two billing models differ; relevant to the STRESS warehouse setup in this account
Claude Agent SDK · overview — the agentic loop the agents are built on; error_max_turns is the turn-budget mechanism that produced trace 231251
Snowflake docs · ACCOUNT_USAGE reference — the source views: WAREHOUSE_METERING_HISTORY, QUERY_HISTORY, SERVICES, QUERY_ATTRIBUTION_HISTORY

· · ·

Numbers in this note come from a seeded Crosshire trial/demo Snowflake account — not a customer — snapshot 16–20 May 2026, ~4 days of metered data. Trace figures (tool call counts, durations, costs) are read directly from the nine on-disk JSONL files; the $8.43 total is the sum across those nine runs. Credit-to-dollar figures use the $3.00/credit list rate as a disclosed estimate. — Crosshire

Darshan Singh

writes Crosshire Journal · crosshire.ch · June 2026

Crosshire Journal

Field reports on data, compute, and the unglamorous decisions that shape engineering teams. Made in EU. Cited evidence, GDPR-native.

Tech·Agentic analytics·Part 4 of 5·2 min read·20 Jun 2026

67 tool calls. No report.

The naive-Opus agent made 67 tool calls and produced no report — killed by error_max_turns before the write-up turn arrived. A re-run lost its own best finding. Three contaminations silently invalidated the control group. Every dispute settled with one grep against the trace.

Darshan Singh

Crosshire auditSnowflake · ACCOUNT_USAGEJune 2026

67 → 0

tool calls, no result event · trace 231251
budget death via error_max_turns
completed sibling: 88 calls · 25.4 min · $3.97 (trace 235655)

Darker bar = interrupted (67 calls, no result, trace 231251). Lighter = completed sibling (88 calls, $3.97, trace 235655). Two separate runs, same database.

What happened · three failures, one referee

1A naive-Opus re-run of the open-ended sweep lost its own best finding from the previous run — never looked for the queuing crisis, added a fresh factual error instead. Exploration is a sample, not a census.

2The full audit run hit error_max_turns after 67 tool calls (trace 231251): no result event, no report. A separate completed run (trace 235655) ran 88 calls, 25.4 min, $3.97. Knowledge decides whether the agent finishes.

3Three contaminations: skill files readable from cwd; a user plugin leaked through project isolation; previous traces in the working dir. Every dispute settled by grepping the trace.

Seeded trial account Traces as referee Method-only public release

01The problem

Budget death and lost findings.

The interrupted naive-Opus run (trace 231251) made 67 tool calls and hit the SDK’s turn limit — the work never presented, because the write-up turn never arrived. A separate re-run of the open-ended sweep lost the queuing finding entirely: same agent, same data, fewer calls, a fresh error, the best finding gone.

02Why it matters

Knowledge decides if the agent finishes.

Naive agents spend turns on things skills make free: schema discovery, environment fights, learning the data’s shape. Under a fixed turn budget, knowledge is not just speed — past a certain task size it decides whether the agent finishes at all. Fixes: 2–3× turn budget for exploration runs; prompt rule to append each finding immediately, never batch the write-up for the end.

03The fix

Traces are the referee.

Three contaminations invalidated the control group: naive ran in the project dir (could read skill files); a user-level plugin leaked through project isolation; previous runs’ full traces sat in the working directory. Fixes: isolated temp dir, global plugins off, traces written outside cwd (a PreToolUse hook is the durable version). Every dispute — including one where the author was wrong about which prompt had run — resolved in under a minute by grepping the run_start event. Agent results without traces are anecdotes.

Want the full forensics?

The long version adds three things this short can’t.

The contamination table. All three failures: what happened, why silent, the fix.
The variance argument. Two process rules from losing a finding on re-run.
The trace as epistemic tool. One dispute, thirty seconds of grep, settled.

Sources

Darshan Singh

writes Crosshire Journal · crosshire.ch · June 2026

Two-minute field fixes from the same audits as our long-form Journal. One number, one fix, one result you can verify.

Crosshire Quick

67 tool calls. No report.

A re-run lost its own best finding.

67 calls. No result event. Vaporized.

All three happened. All three will happen to you.

Every dispute settled the same way.

The rules the lab does not relax.

67 tool calls. No report.

Budget death and lost findings.

Knowledge decides if the agent finishes.

Traces are the referee.

Journal

Crosshire

Legal