00 — Executive Summary
00 — Executive Summary
Every headline number below survived local reproduction, primary-source re-fetch, or explicit ESTIMATE arithmetic; sources are in each file's Verification ledger.
TL;DR
- No honest 10x exists at zero quality loss today; current defensible stack math lands at ≈2.5x (≈2.4x code-heavy), or ≈5–6.2x only if the routing flip passes the harness.
- Output + cache writes dominate dollars, even though cache reads dominate token volume; visible prose compression is useful but cannot touch thinking tokens.
- Negative-cost wins are context architecture wins: tool search, context editing, observation masking, Edit-diffs, repo maps, and advisor escalation save tokens while improving quality.
- Every serious number needs local validation because session mix, effort level, and tokenizer family change the economics materially.
The verdict on 10x
Defensible today: ≈2.5x (≈2.4x on code-heavy mixes). Defensible after validation on your task mix: ≈5–6.2x. A true 10x at provably equal quality does not exist yet. The paper path (Sonnet/Haiku main loops with frontier-model escalation, effort-tiered, context-edited, batch-staged) reaches ~10x in arithmetic but its quality parity on the hardest tasks is unmeasured (T4) — and the binding constraints are structural: (1) thinking bills as output and no style layer touches it — only the effort parameter and not-being-the-frontier-model move it; (2) a cache-read floor of context the agent genuinely needs; (3) quality risk concentrates exactly where the remaining multipliers are. Full math: 30-composed-stacks.md.
Where the money actually goes (measured, this environment)
The measured heavy Fable 5 session decomposes as: cache reads 32% / cache writes 29% / thinking ~20% / visible output ~17% / uncached input 2% (02). An independent session was output-heavier (output 44% / cache writes 34% / cache reads 21%), so use the point estimate as a profile, not a law. Three consequences the market hasn't priced in:
- The optimization target is upside down. Folklore optimizes visible prose (17%); the big four-fifths is cache traffic + thinking. One visible-output token = 5 input = 50 cache-read tokens ($50 vs $1/MTok).
- Thinking is invisible and majority-of-output (54.8% max-effort main loop; 44.8% across a
25-agent fleet — local). Claude Code transcripts redact it; it must be inferred as
output_tokens − count_tokens(visible). - Defaults already bank ~4–5x: caching alone measured −86.3% input-side ($71.59 paid vs $524.23 uncached-equivalent, this very session); MCP schemas defer by default; Edit-diffs are default. Most "10x easy wins" advice re-sells these defaults.
The stack (what to actually run)
Day 1, riskless (≈1.06–1.3x): dedup the double-registered caveman hooks (−966 tok/session,
−118/prompt); pin exploration subagents model: haiku, effort: low (÷10 on code-heavy text, up to
÷13–14 on prose/markdown-heavy text — re-count the actual corpus); two CLAUDE.md guard-lines (Edit-not-Write; no restatement —
89.3% and 91.4% per-instance, measured); never switch model/effort mid-session (cache is
model-scoped; ≈9-turn break-even per switch).
With validation (≈2.5x): effort high→medium on routine work (T1: Opus 4.5 at medium matched Sonnet 4.5's best SWE-bench with 76% fewer output tokens — the single strongest sanctioned number; transfer to Fable 5 must be validated, and this is the only lever that reaches thinking); context editing / observation masking (vendor: 84% token cut, +29% performance; JetBrains T2: masking ≈ −50% cost at parity); register compression on visible prose only (caveman-ultra measured 58.5%, not the marketed 65–75%); route half the work Sonnet-main+advisor (T1: +2.7pp AND −11.9% cost; code-heavy routes get list-price ratios, not the prose tokenizer bonus); batch the offline 30% at 50% off.
The negative-cost set (saves tokens AND improves output — adopt unconditionally): tool search/schema deferral (85% cut, accuracy 49%→74%), context editing, observation masking, Edit-diffs (aider: quality 20%→61%), advisor escalation, repo-maps/outlines instead of file dumps (−85/−92% local), effort max→high. Common thread: less junk in, less junk out — the input-architecture layer is where cost and quality align, and it beats every style trick.
Make it infrastructure (jackin'): bake the pack into every launched container — env
defaults in the launch env assembly, [token_policy] in role manifests, model/effort flags via
CapsuleConfig, CI linter failing when always-loaded context grows. Automatic beats disciplined;
insertion points are mapped in 32.
Corrections that reorder the field (the graveyard, abridged)
Full kill-tables live in files 10–19; the ones that change decisions:
- "Caveman cuts ~75%" → 58.5% measured (token-level, Fable tokenizer). The 75% (now "65%" on the repo) is character-level folklore. And it only touches visible prose — in tool-heavy agent sessions, free-text was 1.4–1.5% of visible output (local); style compression's end-to-end ceiling there is ~0.4% of output tokens. In chat-heavy sessions it's real (~10% of dollars). Wenyan: 80.9% char cut collapses to 56.6% tokens — no advantage over ultra, higher risk.
- "Editing CLAUDE.md mid-session busts the cache" → false; it's read once at session start. Eight real invalidators are enumerated in 13.
- "Keepalive pingers save the cache" → solved problem: Claude Code main loop already writes 1h-TTL cache (320/320 calls observed) and "the cache is refreshed for no additional cost each time" it's read. Also: count_tokens does NOT warm the cache (documented).
- "1M context costs a premium" → dead; flat per-token pricing across the window on Fable 5/Opus 4.8/Sonnet 4.6 (live pricing page). Quality, not price, is the long-context tax.
- "YAML/TOON halve JSON" → minification is most of it: pretty→minified JSON −29%, →CSV −34% further; TOON ≈ CSV+4%; indent width and CSV-vs-TSV are token-identical. Biggest structured-data lever is the format spread: pretty XML→CSV = 2.45x.
- "RouteLLM saves 85%" → MT-Bench-only (45% MMLU, 35% GSM8K); per-request gateway routing also breaks Claude's model-scoped cache.
- "LLMLingua 20x in front of the API" → QUALITY-TRADE trap for coding: a cache-breaking proxy must beat ~5.5x compression to break even vs 0.1x reads; a 2026 RCT on Sonnet 4.5 found keep-20% compression increased cost 1.8%; code tolerates ~10% prompt reduction (T2).
- "Compaction is free" → billed as a separate full-price iteration (~$1.98 per pass at the docs' own example scale), excluded from top-level usage fields.
- "Mem0/semantic caches for agents" → evidence is FAQ/consumer workloads; files-only baselines beat Mem0 on LoCoMo; zero coding-agent evidence. Base64/gzip "compression" costs 2.7–4.3x MORE tokens (measured).
- Tokenizer counts are not portable: Fable 5 = Opus 4.8 tokenizer ≠ Sonnet 4.6 = Haiku 4.5 (exact family equality, measured); the ~30% premium is an ASCII/English tax, but code/CJK can be near-neutral while SCREAMING_SNAKE is extreme. Cross-model budgets must re-count, and an open docs contradiction on prior-turn thinking retention (18 §TL;DR) is worth real money on long sessions.
What we still don't know (highest-value open measurements)
Quality-vs-effort curve for Fable 5 (the 76% transfer); thinking-share by effort level; persona-vs-instruction durability over 100+ turns; caveat-drop rates in terse output registers; composed-stack quality at n=30 (31-validation-harness.md §7 closes all five).
— Read 30 for the math, 32 for the sequence, 31 to prove any of it on your own tasks.