00 — Executive Summary

Every headline number below survived local reproduction, primary-source re-fetch, or explicit ESTIMATE arithmetic; sources are in each file's Verification ledger.

TL;DR

No honest 10x exists at zero quality loss today; current defensible stack math lands at ≈2.5x (≈2.4x code-heavy), or ≈5–6.2x only if the routing flip passes the harness.
Output + cache writes dominate dollars, even though cache reads dominate token volume; visible prose compression is useful but cannot touch thinking tokens.
Negative-cost wins are context architecture wins: tool search, context editing, observation masking, Edit-diffs, repo maps, and advisor escalation save tokens while improving quality.
Every serious number needs local validation because session mix, effort level, and tokenizer family change the economics materially.

Defensible today: ≈2.5x (≈2.4x on code-heavy mixes). Defensible after validation on your task mix: ≈5–6.2x. A true 10x at provably equal quality does not exist yet. The paper path (Sonnet/Haiku main loops with frontier-model escalation, effort-tiered, context-edited, batch-staged) reaches ~10x in arithmetic but its quality parity on the hardest tasks is unmeasured (T4) — and the binding constraints are structural: (1) thinking bills as output and no style layer touches it — only the effort parameter and not-being-the-frontier-model move it; (2) a cache-read floor of context the agent genuinely needs; (3) quality risk concentrates exactly where the remaining multipliers are. Full math: 30-composed-stacks.md.

Where the money actually goes (measured, this environment)

The measured heavy Fable 5 session decomposes as: cache reads 32% / cache writes 29% / thinking ~20% / visible output ~17% / uncached input 2% (02). An independent session was output-heavier (output 44% / cache writes 34% / cache reads 21%), so use the point estimate as a profile, not a law. Three consequences the market hasn't priced in:

The optimization target is upside down. Folklore optimizes visible prose (17%); the big four-fifths is cache traffic + thinking. One visible-output token = 5 input = 50 cache-read tokens ($50 vs $1/MTok).
Thinking is invisible and majority-of-output (54.8% max-effort main loop; 44.8% across a 25-agent fleet — local). Claude Code transcripts redact it; it must be inferred as output_tokens − count_tokens(visible).
Defaults already bank ~4–5x: caching alone measured −86.3% input-side ($71.59 paid vs $524.23 uncached-equivalent, this very session); MCP schemas defer by default; Edit-diffs are default. Most "10x easy wins" advice re-sells these defaults.

The stack (what to actually run)

Day 1, riskless (≈1.06–1.3x): dedup the double-registered caveman hooks (−966 tok/session, −118/prompt); pin exploration subagents model: haiku, effort: low (÷10 on code-heavy text, up to ÷13–14 on prose/markdown-heavy text — re-count the actual corpus); two CLAUDE.md guard-lines (Edit-not-Write; no restatement — 89.3% and 91.4% per-instance, measured); never switch model/effort mid-session (cache is model-scoped; ≈9-turn break-even per switch).

With validation (≈2.5x): effort high→medium on routine work (T1: Opus 4.5 at medium matched Sonnet 4.5's best SWE-bench with 76% fewer output tokens — the single strongest sanctioned number; transfer to Fable 5 must be validated, and this is the only lever that reaches thinking); context editing / observation masking (vendor: 84% token cut, +29% performance; JetBrains T2: masking ≈ −50% cost at parity); register compression on visible prose only (caveman-ultra measured 58.5%, not the marketed 65–75%); route half the work Sonnet-main+advisor (T1: +2.7pp AND −11.9% cost; code-heavy routes get list-price ratios, not the prose tokenizer bonus); batch the offline 30% at 50% off.

The negative-cost set (saves tokens AND improves output — adopt unconditionally): tool search/schema deferral (85% cut, accuracy 49%→74%), context editing, observation masking, Edit-diffs (aider: quality 20%→61%), advisor escalation, repo-maps/outlines instead of file dumps (−85/−92% local), effort max→high. Common thread: less junk in, less junk out — the input-architecture layer is where cost and quality align, and it beats every style trick.

Make it infrastructure (jackin'): bake the pack into every launched container — env defaults in the launch env assembly, [token_policy] in role manifests, model/effort flags via CapsuleConfig, CI linter failing when always-loaded context grows. Automatic beats disciplined; insertion points are mapped in 32.

Corrections that reorder the field (the graveyard, abridged)

Full kill-tables live in files 10–19; the ones that change decisions:

"Caveman cuts ~75%" → 58.5% measured (token-level, Fable tokenizer). The 75% (now "65%" on the repo) is character-level folklore. And it only touches visible prose — in tool-heavy agent sessions, free-text was 1.4–1.5% of visible output (local); style compression's end-to-end ceiling there is ~0.4% of output tokens. In chat-heavy sessions it's real (~10% of dollars). Wenyan: 80.9% char cut collapses to 56.6% tokens — no advantage over ultra, higher risk.
"Editing CLAUDE.md mid-session busts the cache" → false; it's read once at session start. Eight real invalidators are enumerated in 13.
"Keepalive pingers save the cache" → solved problem: Claude Code main loop already writes 1h-TTL cache (320/320 calls observed) and "the cache is refreshed for no additional cost each time" it's read. Also: count_tokens does NOT warm the cache (documented).
"1M context costs a premium" → dead; flat per-token pricing across the window on Fable 5/Opus 4.8/Sonnet 4.6 (live pricing page). Quality, not price, is the long-context tax.
"YAML/TOON halve JSON" → minification is most of it: pretty→minified JSON −29%, →CSV −34% further; TOON ≈ CSV+4%; indent width and CSV-vs-TSV are token-identical. Biggest structured-data lever is the format spread: pretty XML→CSV = 2.45x.
"RouteLLM saves 85%" → MT-Bench-only (45% MMLU, 35% GSM8K); per-request gateway routing also breaks Claude's model-scoped cache.
"LLMLingua 20x in front of the API" → QUALITY-TRADE trap for coding: a cache-breaking proxy must beat ~5.5x compression to break even vs 0.1x reads; a 2026 RCT on Sonnet 4.5 found keep-20% compression increased cost 1.8%; code tolerates ~10% prompt reduction (T2).
"Compaction is free" → billed as a separate full-price iteration (~$1.98 per pass at the docs' own example scale), excluded from top-level usage fields.
"Mem0/semantic caches for agents" → evidence is FAQ/consumer workloads; files-only baselines beat Mem0 on LoCoMo; zero coding-agent evidence. Base64/gzip "compression" costs 2.7–4.3x MORE tokens (measured).
Tokenizer counts are not portable: Fable 5 = Opus 4.8 tokenizer ≠ Sonnet 4.6 = Haiku 4.5 (exact family equality, measured); the ~30% premium is an ASCII/English tax, but code/CJK can be near-neutral while SCREAMING_SNAKE is extreme. Cross-model budgets must re-count, and an open docs contradiction on prior-turn thinking retention (18 §TL;DR) is worth real money on long sessions.

What we still don't know (highest-value open measurements)

Quality-vs-effort curve for Fable 5 (the 76% transfer); thinking-share by effort level; persona-vs-instruction durability over 100+ turns; caveat-drop rates in terse output registers; composed-stack quality at n=30 (31-validation-harness.md §7 closes all five).

— Read 30 for the math, 32 for the sequence, 31 to prove any of it on your own tasks.

00 — Executive Summary

00 — Executive Summary

TL;DR

The verdict on 10x

Where the money actually goes (measured, this environment)

The stack (what to actually run)

Corrections that reorder the field (the graveyard, abridged)

What we still don't know (highest-value open measurements)

On this page