Crosshire
The prompt payoff and the run-rate arithmetic Full operating model: prompt, tiers, economics, ratchet
Tech·Agentic analytics·Part 5 of 5·10 min read·20 Jun 2026

Prompts as engineering, tiers as economics.

The resource monitor showed 0.82 credits used. The account had actually spent ~82. The discrepancy was not a bug — the monitor was created three days into the window and only counts from birth. One timestamp the agent did not check. One rule added to the prompt. That is the operating model in miniature: every failure becomes a rule, every rule compresses the next cycle. The engineered prompt that followed ran skilled-Opus through 125 tool calls, 14 minutes, $1.77 — completed, cache-units bug absent, every formerly-missed dimension covered.

$980/mo
idle-burn run-rate from one warehouse left awake
43.39 idle credits × ~7.5 windows/month × $3.00/credit
$976.28 — at list rate, disclosed as an estimate
Economics · verified traces · 16–20 May 2026
Item Cost Trace
Playbook question — skilled-Sonnet $0.19 213316
Full audit — skilled-Opus, 125 calls, 14 min $1.77 234349
Exploration sweep — naive-Opus, open-ended $0.76 234345
Exploration sweep — naive-Opus, full audit prompt $3.97 235655
9 on-disk traces, entire experiment $8.43 all
Idle-burn waste found — run-rate ~$980/mo see §B
In this report
  1. The engineered prompt — rules and the failures behind them
  2. The TRIAL_GUARD anecdote — adjudication as standing discipline
  3. Three tiers — deterministic, skilled, naive
  4. The ratchet — how the floor rises
  5. The economics — run-rate arithmetic, full table
Series · Skills are the floor, not the ceiling
  1. The $15 lab — the anchor finding and the 2×2 design
  2. Skills vs model — the directed question and three laws
  3. Union beats everyone — the open-ended sweep
  4. Variance and budget death — traces as referee
  5. The operating model — prompt, tiers, economics
Provenance
125tool calls
14.2min, $1.77
$980idle run-rate
$8.43whole lab
9traces on disk
Account
A seeded Crosshire trial Snowflake account, not a customer
Window
16–20 May 2026, ~4 days of metered data; monitor TRIAL_GUARD created 20 May 13:55
Traces
9 JSONL files from the Claude Agent SDK; costs from trace metadata, not estimated
Cost basis
$3.00/credit Snowflake list rate, disclosed as an estimate; run-rate uses ~7.5 four-day windows/month
Seeded trial account Every number from SQL or trace Method-only public release
01The engineered prompt

Each rule in the prompt traces to a failure.

The four earlier parts of this series produced a catalog of things that went wrong: a cache column whose units the agent misread, a security dimension one configuration never examined, budget death that wiped a run’s own findings, a monitor reporting a number that was technically correct and completely misleading. Every one of those became a line in the audit prompt used for the final runs.

Six mandated dimensions. Cost, including total-versus-attributed credits for every warehouse. Performance, including queuing, disk spill, compile ratios, cache hit rates, and partition pruning. Storage. Security and non-default account parameters. Config-versus-behavior contradictions. Cross-cut comparisons across warehouses and time windows. The dimensions name where to look; they do not name what to find. An agent that only confirms known issues is not running a discovery test — it is running a checklist. These dimensions keep it a discovery test.

Unit verification, stated before use. The cache hit rate column is a fraction from 0 to 1, not a percentage from 0 to 100. An earlier run read it as a percentage and reported a figure that looked plausible and was wrong by a factor of 100. The prompt now reads: “Before using ANY numeric column, verify its units and scale by sampling min/max/avg. State the unit you concluded. Never assume milliseconds vs seconds, or fraction vs percent.” In the final skilled-Opus run, the trace shows it sampling the cache column’s range before computing anything from it. The bug did not recur.

Independent verification of every headline number. One query is an assertion. Two independent queries reaching the same number is evidence. The prompt requires a second, independent check on any finding before it is written up.

Baselines read first. If tables named baseline_* exist, read them before running a single discovery query. Treat their contents as known; report only deviations. This is the ratchet’s read-side: findings that survived previous adjudication are not rediscovered at full exploration cost every cycle. Known findings get checked, not hunted.

Append findings immediately. Each finding goes to disk the moment it is confirmed, not held until the end of the run. The failure that made this rule necessary is documented in Part 4: a run that exceeded its turn budget mid-investigation, never reached the write step, and took its own findings with it when it stopped. Immediate append means an interrupted run loses at most the current in-flight hypothesis.

Label every finding. “Confirmed in data” or “hypothesis — needs human context.” Describe behavior, never attribute intent. A 12-finding cap, ranked by materiality, plus a list of every check that came back clean. The clean-check list is what makes “nothing found” an inspectable claim. Without it, the absence of findings is unfalsifiable.

The skilled-Opus run under this prompt: 125 tool calls, 14.2 minutes, $1.77, completed. The trace shows it executing cross-cut comparisons it had previously skipped. It covered every formerly-missed dimension.

02Adjudication as discipline

The monitor said 0.82. The account spent ~82.

TRIAL_GUARD, the account-level resource monitor, showed 0.82 credits used when the snapshot was taken. The account had actually spent roughly 82 credits over the window on warehouse metering alone. The agent flagged this correctly as a contradiction: config and observed behavior were disconnected.

The label on that finding was “hypothesis — needs human context.” One query resolved it: TRIAL_GUARD was created on 20 May 13:55, near the end of the data window. Resource monitors only count credits from their creation date forward. The monitor had been awake for less than a day while the account had been spending for four. The 0.82 figure was technically accurate and completely useless as a governance control.

The monitor was not broken. It was doing what monitors do — counting from birth. The governance failure was creating it three days after the spending started.

One adjudicated hypothesis. One timestamp checked. One rule added to the prompt: “When config and behavior contradict each other, check the object’s creation date before concluding malfunction.” This is what the operating cycle looks like at the level of a single finding: an error the agent made correctly became a permanent guardrail. One cycle. One rule. The floor went up.

Two rules the series does not relax
A model never produces a number. Every credit figure, every percentage in this series came from a SQL query against the DuckDB export. The model read numbers and wrote around them; it did not compute the figures it reported. When the trace records “$1.77,” that cost came from the SDK’s usage metadata, not from the model’s own accounting.
A human ratifies everything. No finding in this series shipped because an agent asserted it. Each was checked against the query that produced it before it appeared here. The discrepancy ledger — ten places where the original draft differed from the database — is the record of that ratification. The model is a fast, tireless reader; judgment stays with a person.
03Three tiers

Deterministic, skilled, naive — each has a job.

The lab produced three operating tiers. They are not stages of maturity — you run all three in parallel, because they answer different questions.

Tier 1: deterministic baselines. Free. Every finding that survives adjudication crystallizes into plain SQL with numeric alert thresholds: baseline_idle_burn, baseline_queuing, baseline_spill. No LLM. These run on every data refresh and flag deviations automatically. Known problems get caught continuously because catching them no longer requires intelligence. The “how often should agents run” debate dissolves for known issues: detection is continuous and costs zero regardless.

Tier 2: skilled agent, cheap model. ~$0.19 per question. Playbook questions on demand, plus Tier-1 alerts that need investigation. The skills carry the causal model; the cheap model executes it. The verified cost for a directed question on this account, skilled-Sonnet, was $0.19 (15 tool calls, 248 seconds, 2 errors). Verify its arithmetic — unit confusion is its known failure mode and it has no playbook telling it to check. For known questions with a playbook, it matches the frontier model at a fraction of the cost.

Tier 3: naive exploration, frontier model. ~$0.76–3.97 per run. Periodic and fully unanchored. The temptation is to feed it the known-findings list so it does not rediscover what you already know. Resist it. Telling the explorer what is known re-installs the ceiling you are paying to escape; it anchors the search to your map instead of the data. Let it run blind. Deduplicate its output against the known-findings ledger afterward, in code or one cheap-model pass. Rediscovery costs tokens. Anchoring costs discoveries. Tokens are the cheaper resource.

Cadence for Tier 3 is discovery-rate-driven, not calendar-driven. While runs keep yielding accepted novel findings, maintain the frequency. As the rate decays, stretch the interval. Reset on change events — new workloads, incidents, releases — because your coverage was of the old system. Tier-3 runs need 2–3× the turn budget of a directed run, plus mandatory checkpoint writes. Part 4 explains what happens when you omit the budget headroom.

Three tiers · cost vs capability verified trace costs · May 2026
COST PER RUN — VERIFIED TRACES TIER 1 — DETERMINISTIC BASELINES $0 per run — no LLM TIER 2 — SKILLED + CHEAP MODEL $0.19 — skilled-Sonnet, trace 213316 TIER 3 — NAIVE EXPLORATION, FRONTIER $0.76–$3.97 — traces 234345 & 235655
Tier-1 baselines cost nothing after initial authoring. Tier-2 directed questions run at $0.19. Tier-3 exploration ranges from $0.76 (open-ended sweep) to $3.97 (full audit prompt, naive-Opus) — both runs completed, verified against trace files on disk.
04The ratchet

Discovery becomes baseline. The floor rises.

The three tiers are connected by one mechanism. Tier 3 discovers something new. A human accepts it as real. It becomes a Tier-1 baseline with an alert threshold, and, where it has a reusable investigation method, a Tier-2 skill. The floor of competence rises continuously. The unanchored scout — Tier 3 — keeps it from ever becoming a fence.

Skills are the floor of competence, never the ceiling of curiosity.

Supporting structure makes the ratchet durable. Traces on everything: not for debugging, but as the ground truth for disputes — including disputes with yourself. An earlier claim in this series was settled by grepping a trace that recorded the prompt verbatim; the trace showed which version of the prompt had actually run. Hooks for hard guardrails: read-only database access enforced at the tool level, trace-directory location outside the working directory so the naive agent cannot read past outputs. Contradiction adjudication as a standing step: when two agents disagree, or when an agent disagrees with a prior human claim, the disagreement is the product. It means one of you is wrong, and SQL is the referee.

The TRIAL_GUARD error above is the ratchet in its smallest form. One hypothesis, one timestamp, one rule. In a running account with weekly sweeps, that mechanism compounds: the baseline library grows, the skill set deepens, and the Tier-3 explorer keeps surfacing what the library has not yet seen.

05The economics

$1.77 to find it. ~$980 a month to leave it.

The idle-burn run-rate arithmetic is straightforward, so show it rather than assert it. COMPUTE_WH ran 43.39 idle credits over the four-day window. A month has roughly 7.5 such windows (30 days / 4 days). At the $3.00/credit list rate — a disclosed estimate, not a contract figure — that is:

43.39 × 7.5 × $3.00 = $976.28 per month COMPUTE_WH idle credits · ledger §B · $3/credit estimate

That 43.39 is not an assertion either. It is metered credits minus attributed credits, per warehouse — the same query that opened the series, runnable on your own account in one statement:

Idle vs working, per warehouseACCOUNT_USAGEsql
WITH metered AS (
  SELECT WAREHOUSE_NAME, SUM(CREDITS_USED) AS credits
  FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
  WHERE START_TIME >= DATEADD('day',-30,CURRENT_TIMESTAMP()) GROUP BY 1),
attributed AS (
  SELECT WAREHOUSE_NAME, SUM(CREDITS_ATTRIBUTED_COMPUTE) AS credits
  FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_ATTRIBUTION_HISTORY
  WHERE START_TIME >= DATEADD('day',-30,CURRENT_TIMESTAMP()) GROUP BY 1)
SELECT m.WAREHOUSE_NAME,
       ROUND(m.credits,2) AS metered_credits,
       ROUND(100*(1 - COALESCE(a.credits,0)/NULLIF(m.credits,0)),1) AS pct_idle
FROM metered m LEFT JOIN attributed a USING (WAREHOUSE_NAME)
ORDER BY metered_credits DESC;

Call it roughly $980 a month. The cost of finding it: $1.77. The ratio is about 550:1. One finding, found in the first full run, repaid the entire experiment many times over. The interesting question is not whether it was worth running the agent. It is what else is sitting in accounts where this arithmetic has never been done.

The nine traces on disk sum to $8.43. That is the entire lab: four agent configurations, two models, directed questions, open-ended sweeps, the full audit prompt. The series reported “under $15” as an informal record; $8.43 is what the trace files show. Both are true of different denominators: $8.43 covers what was logged; under $15 covers the full run including cheap directed queries not kept as files. Neither figure is a model estimate; both come from trace metadata or the author’s direct record.

The binding constraint in this entire exercise was not tokens. It was human attention: reviewing reports, adjudicating contradictions, ratifying numbers. The finding caps, the clean-check lists, the hypothesis labels, the post-hoc deduplication — every structural feature of the prompt exists to spend that attention wisely, not to constrain the agent.

Economics · full detail · verified traces
Run Config Calls Duration Cost
213316 Directed question, skilled-Sonnet 15 248 s $0.19
234345 Open-ended sweep, skilled-Opus 31 575 s $0.76
234349 Full audit, skilled-Opus — engineered prompt 125 852 s $1.77
235655 Full audit, naive-Opus 88 1,526 s $3.97
All 9 traces Entire lab $8.43
Idle-burn waste found — estimated monthly run-rate ~$980/mo

One paragraph to close the series. Playbooks carry the causal model; models carry the execution; frontier runs grow the playbooks; traces referee all of it. A cheap model with a good playbook matches the frontier model on known questions at a fraction of the cost. A frontier model with no playbook discovers what no playbook contains — including your own mistakes — but needs budget headroom and checkpoints to survive open-ended work. A cheap model with no playbook is the trap: fluent, confident, wrong, and priced to tempt you. No single analyst, human included, beat the cross-examined union of several. Build the ratchet, keep one unanchored scout running, and treat every contradiction as the system handing you your next improvement, with evidence attached.

Reproduce it this week
  • Export your ACCOUNT_USAGE views to CSV — query history, warehouse metering, query attribution, storage, logins, users, resource monitors, account parameters. A few days’ window suffices. Redact query-text literals on the way out.
  • Load into DuckDB with a small Python script. Every agent question is now free — no warehouse spins for the agent’s curiosity.
  • Build two Claude Agent SDK agents: one with the schema and investigation playbooks as skills; one with a single sentence, in an isolated directory, traces written outside the working directory, global plugins off. Ask both for everything, with the engineered prompt. Diff. Adjudicate every contradiction with SQL. Watch which of your own “clean” checks fail.
From our audit
This series is the public twin of how a Crosshire audit runs: one export, every finding sourced to the query that produced it, a human ratifying each number before it appears. If there is a warehouse on your account billing for being awake rather than for working, the same method finds it — and the same arithmetic tells you what leaving it running costs per month.
Start a conversation →
Sources & further reading
· · ·

Numbers in this note come from a seeded Crosshire trial Snowflake account — not a customer — snapshot 16–20 May 2026, about four days of metered data. Trace costs are from Claude Agent SDK usage metadata recorded in the nine on-disk JSONL files. Credit-to-dollar figures use the $3.00/credit Snowflake list rate, disclosed as an estimate. The run-rate arithmetic uses ~7.5 four-day windows per month. A model never produced a number in this series; a human ratified every one before it shipped. — Crosshire

D
writes Crosshire Journal · crosshire.ch · June 2026
Crosshire Journal
Field reports on data, compute, and the unglamorous decisions that shape engineering teams. Made in EU. Cited evidence, GDPR-native.
Tech·Agentic analytics·Part 5 of 5·2 min read·20 Jun 2026

The resource monitor said 0.82. The account spent ~82.

TRIAL_GUARD was created three days into the data window and only counts from birth. The engineered prompt that caught this ran skilled-Opus through 125 tool calls for $1.77. The idle waste it confirmed runs at ~$980 a month.

$980/mo
idle-burn run-rate from one warehouse
43.39 idle credits × 7.5 windows × $3.00/credit
found for $1.77 in one agent run
COST TO FIND vs COST TO LEAVE — MONTHLY ESTIMATE $1.77 to find it ~$980/mo to leave it 43.39 IDLE CREDITS × 7.5 WINDOWS × $3/CREDIT ESTIMATE
The $1.77 bar is literally one pixel. The $980/mo bar is the cost of inaction at the $3.00/credit list rate, disclosed as an estimate.
Provenance · what happened

1TRIAL_GUARD showed 0.82 credits used. The account had spent ~82. The agent labeled it a hypothesis. One timestamp resolved it: the monitor was created 20 May 13:55, near the end of the window.

2That hypothesis became a prompt rule: “check creation date before concluding malfunction.” Each of the six other rules in the engineered prompt traces to a different past failure — a unit misread, a budget-death run, a missed dimension. Prompts as engineering.

3Skilled-Opus ran the full audit under that prompt: 125 tool calls, 14.2 minutes, $1.77. Every formerly-missed dimension covered. Idle waste confirmed at ~$980/mo run-rate. The cache-units bug did not recur.

01The problem

A monitor that only counts from birth.

Resource monitors in Snowflake track credit usage from the date they are created — not from the start of the billing period. TRIAL_GUARD was created on 20 May 13:55, near the end of a four-day window where the account had spent roughly 82 credits on warehouse metering. The monitor reported 0.82 credits used. Both numbers were accurate. Neither told you anything useful about the actual spend.

02Why it happened — why it matters

Every failure becomes a rule. Every rule costs less next time.

The cache-column unit error in an earlier run — the agent read a fraction (0–1) as a percentage, reporting a figure wrong by a factor of 100 — became one line in the engineered prompt: verify units before use, state the unit you concluded. That is prompts as engineering: a failure catalog compressed into standing instructions. The skilled-Opus run under the engineered prompt cost $1.77 and covered every formerly-missed dimension; the naive-Opus run under the same prompt cost $3.97 with no playbook carrying the causal model. The $0.19 skilled-Sonnet directed run matched the frontier model on known questions. Tier economics fall out of that arithmetic: the playbook determines cost at least as much as the model does.

03The arithmetic

$1.77 to find it. ~$980 a month to ignore it.

COMPUTE_WH ran 43.39 idle credits in four days. Multiply by 7.5 windows per month, then by the $3.00/credit list rate (a disclosed estimate): $976.28 per month. The full lab — nine traced runs, two agents, two model tiers — cost $8.43 in total. One finding paid for the experiment many times over. The query that proves the idle split is in Part 1.

Want the full operating model?
The long version adds three things this short can’t.
  • The engineered prompt, rule by rule — every line traces to a past failure.
  • Three tiers, verified costs — $0 baselines, $0.19/question Sonnet, $0.76–$3.97 Opus runs.
  • The ratchet — how Tier-3 finds become Tier-1 checks and Tier-2 skills.
Sources
D
writes Crosshire Journal · crosshire.ch · June 2026
Two-minute field fixes from the same audits as our long-form Journal. One number, one fix, one result you can verify.
Crosshire Quick
© 2026 Crosshire Journal · Made in EU Privacy Terms Cookies License Imprint Coffee