67 tool calls. No report.
The naive-Opus agent ran 67 tool calls — trace 231251 — and never composed its report, killed by the SDK’s turn limit before the write-up turn arrived. A re-run of the open-ended sweep lost its own best finding. Three contaminations had been quietly invalidating the control group since the first run. Every dispute settled the same way: grep the trace.
budget death via
error_max_turns — never composed its reportcompleted sibling: trace 235655 · 88 calls · 25.4 min · $3.97
| Run (trace ID) | Tool calls | Duration | Cost | Outcome |
|---|---|---|---|---|
| 231251 — interrupted | 67 | — | — | No result event. Budget death. |
| 235655 — completed | 88 | 25.4 min | $3.97 | Full report composed |
- Exploration is a sample, not a census
- Budget death — 67 calls, no report
- Three contaminations
- The trace as referee
- Two discipline points that don’t flex
- The $15 lab — the anchor finding and the 2×2 design
- Skills vs model — the directed question and three lawssoon
- Union beats everyone — the open-ended sweepsoon
- Variance and budget death — traces as referee
- The operating model — prompt, tiers, economicssoon
- Account
- A seeded Crosshire trial/demo Snowflake account, not a customer
- Window
- 16–20 May 2026, ~4 days of metered data (47,420 queries, 39 warehouses)
- Key artifacts
- Trace 231251 (interrupted, 67 calls, no result event) · trace 235655 (completed, 88 calls, 25.4 min, $3.97)
- Lab total
- 9 runs on disk, $8.43 total; “under $15” includes unlisted cheap directed runs not kept
- Cost basis
- $3.00/credit Snowflake list rate, disclosed as an estimate
A re-run lost its own best finding.
After the open-ended sweep documented in Part 3, I re-ran the same naive-Opus configuration against the same database. The re-run came back faster. It made fewer tool calls. It produced a tidy report.
It had also lost the queuing crisis — the finding from the previous run that falsified my “clean” claim about the account. Not contradicted: simply never looked at. The agent looked elsewhere, found other things, and wrote those up instead. The re-run also introduced a fresh factual error: inverted thresholds on a resource monitor, notify and suspend swapped.
Same agent. Same data. Hours apart. Materially different coverage and a new mistake. The second run’s report read fine. The lesson is structural, not incidental: an exploration run is a sample, not a census. Unconstrained search is exactly that, and what it finds varies by run. Two rules fell out of this and became permanent process:
Neither rule is special to Snowflake or to this particular agent. They apply to any unconstrained search over a large space with a variable budget. The trace made the failure visible; a re-run that agreed with the first would have hidden it permanently.
Note: the second sweep’s trace may not be among the nine on-disk files. The account above is the author’s recounted observation. Trace 231251 is the surviving artifact from the interrupted audit run described in the next section — a separate event. The argument stands on the recollection; the interrupted run is the traceable artifact.
67 calls. No result event. Vaporized.
For the full audit run I wrote a demanding prompt — the engineered
version whose rules are the subject of Part 5. Skilled-Opus completed it:
125 tool calls, 14.2 minutes, $1.77, full report composed
(trace 234349). Naive-Opus, given the same prompt and the same turn budget,
hit the SDK’s turn limit — error_max_turns —
before reaching the write-up turn.
That interrupted run is trace 231251: 67 tool calls, no
result event in the JSONL. The console showed
nothing. The output directory showed nothing. The interrupted run —
whatever it had done — never presented it, because presentation was
scheduled for a turn that never arrived. A separate completed naive-Opus
audit (trace 235655) ran 88 calls, 25.4 minutes, $3.97
and did produce a report; that is a different run on a different day, not
a recovery of 231251. Keep these two straight. The budget-death story
belongs to 231251.
The mechanism is mundane, which is why it matters. The naive agent spends turns on things skills make free: discovering the schema, learning the data’s shape, fighting the environment. On Windows, multiline scripts piped through the shell are their own problem — both agents burned double-digit error counts on quoting until one of them, unprompted, wrote itself a small query-helper script. I now ship that helper in the project setup. The naive agent also batches fewer tool calls per turn. Under a fixed budget, knowledge doesn’t just make agents faster — past a certain task size, it decides whether they finish at all.
Two fixes, both one-liners. Exploration runs get 2–3× the turn budget of playbook runs. And the prompt orders the agent to append each finding to a file the moment it is confirmed — never batch the write-up for the end. When the interrupted run (231251) died, its checkpoint file survived in the working directory. Recovered, it contained a unit-calibration section, the baseline check, and — irony fully intended — the idle-burn computation that this same configuration had missed in every previous directed run. The work was not lost. It was just never presented, because presentation was scheduled for a turn that never arrived.
All three happened. All three will happen to you.
These are qualitative failures. No new numbers: the contaminations are structural and the fixes are one-liners each. I am listing them because every one of them happened to me before I noticed, which means they are the failures most likely to happen to you quietly.
The control group could read the answer key. The first
“naive” agent ran in the project directory, where the skill
playbooks live as ordinary markdown files and the agent has filesystem
tools. It does not need the skills loaded into its system prompt
to cat them; it only needs to be curious enough to look.
The fix is simple: naive runs in an isolated temp directory containing
only the database. If the skill files are not in the directory tree, the
agent cannot reach them regardless of curiosity.
A user-level plugin leaked through the sandbox. Even after isolation, one naive run loaded a generic data-engineering skill from my user-level plugin cache. Project isolation does not block user-scope plugins; they inject into every agent on the machine. Two saving graces: the leaked skill contained zero Snowflake billing knowledge, and that run got the causal model wrong anyway — which if anything strengthened the result. But “my control group was accidentally stronger than designed and still failed” is a sentence you only get to write if you noticed. Fix: disable global plugins for control runs, and know they exist at all before assuming project isolation is complete.
Earlier traces sat readable in the working directory.
When a cheap skilled run matched the expensive one suspiciously well, I
checked whether it had simply read the previous run’s output.
Trace forensics said no — every tool call accounted for, none
touching any trace file. It had earned its answer. But nothing had
prevented cheating except the agent not happening to look.
“My experiment was valid by luck” is not methodology. Fix:
traces write outside the agents’ working directory. The durable
version of this fix is a PreToolUse hook that denies any
tool call whose path touches the trace directory — prompts
request behavior; hooks prevent it.
| Contamination | What happened | Fix |
|---|---|---|
| Skills readable from cwd | Naive agent ran in project dir; could cat skill files from disk |
Isolated temp dir, database only |
| User-level plugin leaked | Project isolation did not block user-scope plugin cache injection | Disable global plugins for control runs |
| Traces readable in cwd | Previous runs’ full answers sat in the working directory | Write traces outside cwd; PreToolUse hook for durable prevention |
Every dispute settled the same way.
What turned each of these failures from a lurking doubt into a settled fact was the same artifact: the trace. One JSON line per event — the exact prompt, every SQL statement, every result, every error, the final cost. Written to disk immediately so an interrupted run loses nothing except the turns it did not reach.
The traces’ finest hour was a dispute with me on the losing side.
I edited the agent’s default prompt to the new audit prompt, ran
it, and got back a shallow report. I was certain the new prompt had run.
The trace’s run_start event — which records the
prompt verbatim — showed the old one-liner. A scan of all events
found zero fragments of the new text anywhere in the file. The edit had
landed in the wrong copy of the file. Thirty seconds of grep settled
what could have been an hour of arguing with a confusing result, or
worse, a wrong conclusion drawn from a run that was never asked to do
the thing I thought I had asked.
The trace is ground truth for what the agent was actually asked, what it actually did, and what it actually cost. In an experiment, all three will be disputed — including by you. Agent results without traces are anecdotes.
The same artifact settled the contamination forensics. The question “did the cheap skilled run read the previous trace?” was not answerable from the report alone — the report looked correct. The trace showed every tool call accounted for, none touching the trace directory. That answer took under a minute to get. Without the trace, the ambiguity would have been permanent.
This is the argument for tracing everything, and it is not primarily about debugging. Debugging is a side benefit. The real argument is epistemic: an agentic investigation is not a deterministic computation. You cannot reconstruct what happened from the output alone. The trace is the only record of what was asked and what was found, and in a controlled experiment it is the only way to tell “the agent earned this answer” from “the agent read it from a previous run.”
The rules the lab does not relax.
The failures in this part are methodological. Budget death is a configuration gap. Contamination is a setup gap. Variance is a misunderstanding of what unconstrained exploration produces. Each has a fix that is a one-liner in configuration or a one-sentence addition to the prompt. None of them requires a different model or a different framework.
What they do require is the discipline that runs through this entire series. Stated plainly:
Part 5 is the payoff: the engineered audit prompt whose rules each map to a failure catalogued in this series — and what happened when it ran.
- The engineered audit prompt. Each rule traces to a specific failure in this series. Skilled-Opus under it: 125 tool calls, 14.2 min, $1.77, completed (trace 234349).
- The three-tier operating model. Deterministic baselines (free) → skilled + cheap model (~$0.19/question) → naive frontier exploration. The discovery ratchet: Tier 3 finds, human accepts, becomes Tier 1 baseline.
- The economics. $0.19 questions, $1.77 audits, and the idle-burn run-rate found by the same agent that failed here.
- Crosshire Journal · The $15 lab: what four AI agents found in one Snowflake account — the series anchor: 2×2 design, the four configurations, the idle-burn finding that started it all
- Crosshire Journal · Your Snowflake login count is probably lying to you — the zero-MFA posture the same account surfaced, audited separately
- Crosshire Journal · Adaptive vs classic warehouses — how the two billing models differ; relevant to the STRESS warehouse setup in this account
- Claude Agent SDK · overview — the agentic loop the agents are built on;
error_max_turnsis the turn-budget mechanism that produced trace 231251 - Snowflake docs · ACCOUNT_USAGE reference — the source views:
WAREHOUSE_METERING_HISTORY,QUERY_HISTORY,SERVICES,QUERY_ATTRIBUTION_HISTORY
Numbers in this note come from a seeded Crosshire trial/demo Snowflake account — not a customer — snapshot 16–20 May 2026, ~4 days of metered data. Trace figures (tool call counts, durations, costs) are read directly from the nine on-disk JSONL files; the $8.43 total is the sum across those nine runs. Credit-to-dollar figures use the $3.00/credit list rate as a disclosed estimate. — Crosshire
67 tool calls. No report.
The naive-Opus agent made 67 tool calls and produced no report
— killed by error_max_turns before the write-up turn arrived.
A re-run lost its own best finding. Three contaminations silently invalidated
the control group. Every dispute settled with one grep against the trace.
budget death via
error_max_turnscompleted sibling: 88 calls · 25.4 min · $3.97 (trace 235655)
1A naive-Opus re-run of the open-ended sweep lost its own best finding from the previous run — never looked for the queuing crisis, added a fresh factual error instead. Exploration is a sample, not a census.
2The full audit run hit error_max_turns after 67 tool calls (trace 231251): no result event, no report. A separate completed run (trace 235655) ran 88 calls, 25.4 min, $3.97. Knowledge decides whether the agent finishes.
3Three contaminations: skill files readable from cwd; a user plugin leaked through project isolation; previous traces in the working dir. Every dispute settled by grepping the trace.
Budget death and lost findings.
The interrupted naive-Opus run (trace 231251) made 67 tool calls and hit the SDK’s turn limit — the work never presented, because the write-up turn never arrived. A separate re-run of the open-ended sweep lost the queuing finding entirely: same agent, same data, fewer calls, a fresh error, the best finding gone.
Knowledge decides if the agent finishes.
Naive agents spend turns on things skills make free: schema discovery, environment fights, learning the data’s shape. Under a fixed turn budget, knowledge is not just speed — past a certain task size it decides whether the agent finishes at all. Fixes: 2–3× turn budget for exploration runs; prompt rule to append each finding immediately, never batch the write-up for the end.
Traces are the referee.
Three contaminations invalidated the control group: naive ran in the project
dir (could read skill files); a user-level plugin leaked through project
isolation; previous runs’ full traces sat in the working directory.
Fixes: isolated temp dir, global plugins off, traces written outside cwd
(a PreToolUse hook is the durable version). Every dispute
— including one where the author was wrong about which prompt had run
— resolved in under a minute by grepping the run_start
event. Agent results without traces are anecdotes.
- The contamination table. All three failures: what happened, why silent, the fix.
- The variance argument. Two process rules from losing a finding on re-run.
- The trace as epistemic tool. One dispute, thirty seconds of grep, settled.