Crosshire
Budget death & contaminations — the short version Full trace forensics, three contaminations, the referee
Tech·Agentic analytics·Part 4 of 5·10 min read·20 Jun 2026

67 tool calls. No report.

The naive-Opus agent ran 67 tool calls — trace 231251 — and never composed its report, killed by the SDK’s turn limit before the write-up turn arrived. A re-run of the open-ended sweep lost its own best finding. Three contaminations had been quietly invalidating the control group since the first run. Every dispute settled the same way: grep the trace.

67 → 0
67 tool calls · trace 231251 · no result event
budget death via error_max_turns — never composed its report
completed sibling: trace 235655 · 88 calls · 25.4 min · $3.97
Two naive-Opus audit runs · same database · same question
Run (trace ID) Tool calls Duration Cost Outcome
231251 — interrupted 67 No result event. Budget death.
235655 — completed 88 25.4 min $3.97 Full report composed
In this report
  1. Exploration is a sample, not a census
  2. Budget death — 67 calls, no report
  3. Three contaminations
  4. The trace as referee
  5. Two discipline points that don’t flex
Series · Skills are the floor, not the ceiling
  1. The $15 lab — the anchor finding and the 2×2 design
  2. Skills vs model — the directed question and three lawssoon
  3. Union beats everyone — the open-ended sweepsoon
  4. Variance and budget death — traces as referee
  5. The operating model — prompt, tiers, economicssoon
Provenance
67calls, no report
88calls, completed
$3.97completed run
3contaminations
$8.439 runs on disk
Account
A seeded Crosshire trial/demo Snowflake account, not a customer
Window
16–20 May 2026, ~4 days of metered data (47,420 queries, 39 warehouses)
Key artifacts
Trace 231251 (interrupted, 67 calls, no result event) · trace 235655 (completed, 88 calls, 25.4 min, $3.97)
Lab total
9 runs on disk, $8.43 total; “under $15” includes unlisted cheap directed runs not kept
Cost basis
$3.00/credit Snowflake list rate, disclosed as an estimate
Seeded trial account Numbers from traces Method-only public release
01Exploration is a sample

A re-run lost its own best finding.

After the open-ended sweep documented in Part 3, I re-ran the same naive-Opus configuration against the same database. The re-run came back faster. It made fewer tool calls. It produced a tidy report.

It had also lost the queuing crisis — the finding from the previous run that falsified my “clean” claim about the account. Not contradicted: simply never looked at. The agent looked elsewhere, found other things, and wrote those up instead. The re-run also introduced a fresh factual error: inverted thresholds on a resource monitor, notify and suspend swapped.

Same agent. Same data. Hours apart. Materially different coverage and a new mistake. The second run’s report read fine. The lesson is structural, not incidental: an exploration run is a sample, not a census. Unconstrained search is exactly that, and what it finds varies by run. Two rules fell out of this and became permanent process:

Two rules from the variance failure
A discovery counts once it recurs across runs or is verified deterministically. One exploration hit is evidence worth pursuing; it is not a confirmed finding.
Coverage you care about cannot be left to the agent’s mood. It must be either mandated in the prompt or crystallized into deterministic SQL that runs every time — so a finding, once accepted, can never be un-found on the next run.

Neither rule is special to Snowflake or to this particular agent. They apply to any unconstrained search over a large space with a variable budget. The trace made the failure visible; a re-run that agreed with the first would have hidden it permanently.

Note: the second sweep’s trace may not be among the nine on-disk files. The account above is the author’s recounted observation. Trace 231251 is the surviving artifact from the interrupted audit run described in the next section — a separate event. The argument stands on the recollection; the interrupted run is the traceable artifact.

02Budget death

67 calls. No result event. Vaporized.

For the full audit run I wrote a demanding prompt — the engineered version whose rules are the subject of Part 5. Skilled-Opus completed it: 125 tool calls, 14.2 minutes, $1.77, full report composed (trace 234349). Naive-Opus, given the same prompt and the same turn budget, hit the SDK’s turn limit — error_max_turns — before reaching the write-up turn.

That interrupted run is trace 231251: 67 tool calls, no result event in the JSONL. The console showed nothing. The output directory showed nothing. The interrupted run — whatever it had done — never presented it, because presentation was scheduled for a turn that never arrived. A separate completed naive-Opus audit (trace 235655) ran 88 calls, 25.4 minutes, $3.97 and did produce a report; that is a different run on a different day, not a recovery of 231251. Keep these two straight. The budget-death story belongs to 231251.

Two naive-Opus audit runs · tool calls · outcome interrupted vs completed
NAIVE-OPUS AUDIT RUNS · TOOL CALLS · OUTCOME · SAME DATABASE 67 TOOL CALLS NO RESULT 231251 · interrupted 88 TOOL CALLS COMPLETED $3.97 · 25.4 min 235655 · completed
Darker bar = interrupted run (67 calls, no result event, trace 231251). Lighter bar = completed run (88 calls, 25.4 min, $3.97, trace 235655). These are two separate runs, not a before-and-after of the same session.

The mechanism is mundane, which is why it matters. The naive agent spends turns on things skills make free: discovering the schema, learning the data’s shape, fighting the environment. On Windows, multiline scripts piped through the shell are their own problem — both agents burned double-digit error counts on quoting until one of them, unprompted, wrote itself a small query-helper script. I now ship that helper in the project setup. The naive agent also batches fewer tool calls per turn. Under a fixed budget, knowledge doesn’t just make agents faster — past a certain task size, it decides whether they finish at all.

Two fixes, both one-liners. Exploration runs get 2–3× the turn budget of playbook runs. And the prompt orders the agent to append each finding to a file the moment it is confirmed — never batch the write-up for the end. When the interrupted run (231251) died, its checkpoint file survived in the working directory. Recovered, it contained a unit-calibration section, the baseline check, and — irony fully intended — the idle-burn computation that this same configuration had missed in every previous directed run. The work was not lost. It was just never presented, because presentation was scheduled for a turn that never arrived.

03Three contaminations

All three happened. All three will happen to you.

These are qualitative failures. No new numbers: the contaminations are structural and the fixes are one-liners each. I am listing them because every one of them happened to me before I noticed, which means they are the failures most likely to happen to you quietly.

The control group could read the answer key. The first “naive” agent ran in the project directory, where the skill playbooks live as ordinary markdown files and the agent has filesystem tools. It does not need the skills loaded into its system prompt to cat them; it only needs to be curious enough to look. The fix is simple: naive runs in an isolated temp directory containing only the database. If the skill files are not in the directory tree, the agent cannot reach them regardless of curiosity.

A user-level plugin leaked through the sandbox. Even after isolation, one naive run loaded a generic data-engineering skill from my user-level plugin cache. Project isolation does not block user-scope plugins; they inject into every agent on the machine. Two saving graces: the leaked skill contained zero Snowflake billing knowledge, and that run got the causal model wrong anyway — which if anything strengthened the result. But “my control group was accidentally stronger than designed and still failed” is a sentence you only get to write if you noticed. Fix: disable global plugins for control runs, and know they exist at all before assuming project isolation is complete.

Earlier traces sat readable in the working directory. When a cheap skilled run matched the expensive one suspiciously well, I checked whether it had simply read the previous run’s output. Trace forensics said no — every tool call accounted for, none touching any trace file. It had earned its answer. But nothing had prevented cheating except the agent not happening to look. “My experiment was valid by luck” is not methodology. Fix: traces write outside the agents’ working directory. The durable version of this fix is a PreToolUse hook that denies any tool call whose path touches the trace directory — prompts request behavior; hooks prevent it.

Three contaminations · what happened · fix
Contamination What happened Fix
Skills readable from cwd Naive agent ran in project dir; could cat skill files from disk Isolated temp dir, database only
User-level plugin leaked Project isolation did not block user-scope plugin cache injection Disable global plugins for control runs
Traces readable in cwd Previous runs’ full answers sat in the working directory Write traces outside cwd; PreToolUse hook for durable prevention
04The trace as referee

Every dispute settled the same way.

What turned each of these failures from a lurking doubt into a settled fact was the same artifact: the trace. One JSON line per event — the exact prompt, every SQL statement, every result, every error, the final cost. Written to disk immediately so an interrupted run loses nothing except the turns it did not reach.

The traces’ finest hour was a dispute with me on the losing side. I edited the agent’s default prompt to the new audit prompt, ran it, and got back a shallow report. I was certain the new prompt had run. The trace’s run_start event — which records the prompt verbatim — showed the old one-liner. A scan of all events found zero fragments of the new text anywhere in the file. The edit had landed in the wrong copy of the file. Thirty seconds of grep settled what could have been an hour of arguing with a confusing result, or worse, a wrong conclusion drawn from a run that was never asked to do the thing I thought I had asked.

The trace is ground truth for what the agent was actually asked, what it actually did, and what it actually cost. In an experiment, all three will be disputed — including by you. Agent results without traces are anecdotes.

The same artifact settled the contamination forensics. The question “did the cheap skilled run read the previous trace?” was not answerable from the report alone — the report looked correct. The trace showed every tool call accounted for, none touching the trace directory. That answer took under a minute to get. Without the trace, the ambiguity would have been permanent.

This is the argument for tracing everything, and it is not primarily about debugging. Debugging is a side benefit. The real argument is epistemic: an agentic investigation is not a deterministic computation. You cannot reconstruct what happened from the output alone. The trace is the only record of what was asked and what was found, and in a controlled experiment it is the only way to tell “the agent earned this answer” from “the agent read it from a previous run.”

05Two discipline points

The rules the lab does not relax.

The failures in this part are methodological. Budget death is a configuration gap. Contamination is a setup gap. Variance is a misunderstanding of what unconstrained exploration produces. Each has a fix that is a one-liner in configuration or a one-sentence addition to the prompt. None of them requires a different model or a different framework.

What they do require is the discipline that runs through this entire series. Stated plainly:

Two rules this lab does not relax
The model never produces a number. Every call count, every duration, every cost in this note comes from the trace JSONL — not from the agent’s prose summary of its own run. The model reads numbers and writes prose around them; it does not measure itself.
A human ratifies everything. No finding in this series ships because an agent asserted it. Each one is checked against the artifact that produced it before it appears here. The traces are the artifacts. The judgment of what is true stays with a person.

Part 5 is the payoff: the engineered audit prompt whose rules each map to a failure catalogued in this series — and what happened when it ran.

Next in the series · Part 5
  • The engineered audit prompt. Each rule traces to a specific failure in this series. Skilled-Opus under it: 125 tool calls, 14.2 min, $1.77, completed (trace 234349).
  • The three-tier operating model. Deterministic baselines (free) → skilled + cheap model (~$0.19/question) → naive frontier exploration. The discovery ratchet: Tier 3 finds, human accepts, becomes Tier 1 baseline.
  • The economics. $0.19 questions, $1.77 audits, and the idle-burn run-rate found by the same agent that failed here.
From our audit
Budget death, contaminated controls, and findings that evaporate on re-run are not edge cases — they are what running agentic experiments looks like before someone puts process around them. The Crosshire audit method is that process: one export, every claim sourced to a trace event, a human signing off each number before it lands. If you want to run this kind of investigation on your own Snowflake account, the same rig applies.
Start a conversation →
Sources & further reading
· · ·

Numbers in this note come from a seeded Crosshire trial/demo Snowflake account — not a customer — snapshot 16–20 May 2026, ~4 days of metered data. Trace figures (tool call counts, durations, costs) are read directly from the nine on-disk JSONL files; the $8.43 total is the sum across those nine runs. Credit-to-dollar figures use the $3.00/credit list rate as a disclosed estimate. — Crosshire

D
writes Crosshire Journal · crosshire.ch · June 2026
Crosshire Journal
Field reports on data, compute, and the unglamorous decisions that shape engineering teams. Made in EU. Cited evidence, GDPR-native.
Tech·Agentic analytics·Part 4 of 5·2 min read·20 Jun 2026

67 tool calls. No report.

The naive-Opus agent made 67 tool calls and produced no report — killed by error_max_turns before the write-up turn arrived. A re-run lost its own best finding. Three contaminations silently invalidated the control group. Every dispute settled with one grep against the trace.

67 → 0
tool calls, no result event · trace 231251
budget death via error_max_turns
completed sibling: 88 calls · 25.4 min · $3.97 (trace 235655)
TWO NAIVE-OPUS AUDIT RUNS · TOOL CALLS · OUTCOME 67 NO RESULT 231251 interrupted 88 COMPLETED $3.97 · 25.4 min 235655 completed
Darker bar = interrupted (67 calls, no result, trace 231251). Lighter = completed sibling (88 calls, $3.97, trace 235655). Two separate runs, same database.
What happened · three failures, one referee

1A naive-Opus re-run of the open-ended sweep lost its own best finding from the previous run — never looked for the queuing crisis, added a fresh factual error instead. Exploration is a sample, not a census.

2The full audit run hit error_max_turns after 67 tool calls (trace 231251): no result event, no report. A separate completed run (trace 235655) ran 88 calls, 25.4 min, $3.97. Knowledge decides whether the agent finishes.

3Three contaminations: skill files readable from cwd; a user plugin leaked through project isolation; previous traces in the working dir. Every dispute settled by grepping the trace.

01The problem

Budget death and lost findings.

The interrupted naive-Opus run (trace 231251) made 67 tool calls and hit the SDK’s turn limit — the work never presented, because the write-up turn never arrived. A separate re-run of the open-ended sweep lost the queuing finding entirely: same agent, same data, fewer calls, a fresh error, the best finding gone.

02Why it matters

Knowledge decides if the agent finishes.

Naive agents spend turns on things skills make free: schema discovery, environment fights, learning the data’s shape. Under a fixed turn budget, knowledge is not just speed — past a certain task size it decides whether the agent finishes at all. Fixes: 2–3× turn budget for exploration runs; prompt rule to append each finding immediately, never batch the write-up for the end.

03The fix

Traces are the referee.

Three contaminations invalidated the control group: naive ran in the project dir (could read skill files); a user-level plugin leaked through project isolation; previous runs’ full traces sat in the working directory. Fixes: isolated temp dir, global plugins off, traces written outside cwd (a PreToolUse hook is the durable version). Every dispute — including one where the author was wrong about which prompt had run — resolved in under a minute by grepping the run_start event. Agent results without traces are anecdotes.

Want the full forensics?
The long version adds three things this short can’t.
  • The contamination table. All three failures: what happened, why silent, the fix.
  • The variance argument. Two process rules from losing a finding on re-run.
  • The trace as epistemic tool. One dispute, thirty seconds of grep, settled.
D
writes Crosshire Journal · crosshire.ch · June 2026
Two-minute field fixes from the same audits as our long-form Journal. One number, one fix, one result you can verify.
Crosshire Quick
© 2026 Crosshire Journal · Made in EU Privacy Terms Cookies License Imprint Coffee