Unbelievable

This file does the end-to-end dollar math. Savings are composed per token class, sequentially (multipliers on the class a technique actually touches) — never by multiplying headline percentages across different classes.

TL;DR

The defaults already bank the first ~4–5x. Prompt caching (−86.3% input-side, measured live), MCP schema deferral, Edit-tool diffs, and 1h-TTL main-loop caching are Claude Code defaults in 2026 — the baseline this dossier optimizes is already the optimized world of 2024-era folklore. Most "easy 10x" claims are unknowingly re-selling the defaults.
Conservative stack (T1/T2, zero quality risk, Claude-Code-today): 1.06x main-loop, up to ~1.3x on fan-out days. The honest number nobody markets: with good defaults, riskless config-level wins are small.
Aggressive stack (adds T3 + SDK tier + validation plans): ≈2.5x ($21.83 → ~$8.6/day on the modeled profile), led by effort-tiering, model routing at task boundaries, context editing, and register compression.
Unbelievable stack (everything defensible incl. BUILDABLE frontier): ≈5–6.2x at the modeled profile; a paper path to ~10x exists only by making Sonnet the main loop with frontier-model escalation — quality parity there is unproven (T4), so 10x is NOT defensible at zero quality loss today.
Binding constraints, in order: (1) frontier-model thinking output — no API lever except effort touches it; (2) the cache-read floor of context the agent genuinely uses; (3) quality risk of cheap-model main loops on the hardest tasks.

0. Baseline and composition rules

Working profile (01-economics-and-measurement.md §5, $22 variant = 6 × measured session):

Class	Symbol	Tokens/day	$/day
Uncached input	U	33k	$0.33
Cache writes	W	510k	$6.38
Cache reads	R	7.02M	$7.02
Output — thinking (55%)	T	89k	$4.46
Output — visible (45%)	V	73k	$3.65
Total			$21.83

Composition rules: a technique is a multiplier on one or two classes; sequential application (order shown); cross-class couplings carried explicitly (shorter outputs → slower transcript growth → smaller future R; context clears → extra W). Every multiplier cites its source file. Sensitivity: the $17/45%-thinking floor profile shifts totals ~−22% but does not change any ratio (same multipliers).

What the baseline already includes (do not double-count as savings): prompt caching (13: measured −86.3% input-side vs uncached equivalent — $71.59 paid vs $524.23 on this very session), MCP deferral by default (12), Edit-tool default (15), 1h-TTL main-loop writes (13: 320/320 calls observed), prior-turn thinking billed per current docs (18).

1. Conservative stack — "do this tomorrow"

Constraints: T1/T2 evidence only, CLAUDE-CODE-TODAY only, zero quality risk (each item is NEGATIVE-COST or NEUTRAL with a trivial falsification check).

#	Technique (file)	Class effect	$/day
C1	Edit-over-Write guard + no-restatement rules in CLAUDE.md (15 §2–3; T1 local: 89.3% per avoided rewrite, 352 tok per avoided restatement)	V × 0.80	−$0.73
C2	Cache hygiene: never `/model`/`/effort` mid-session; stable MCP set; no mid-session CLAUDE.md additions to the prefix path that re-form it (16 §cache; 13 §invalidators; T1: one avoided 150k bust ≈ $0.43)	W − bust	−$0.43
C3	Instruction-mass audit: keep root file lean (10 §5: −60% of a 2.7k-token file ≈ $0.05/session; this repo is already lean)	R × 0.99	−$0.07
C4	Effort `max → high` for users who pinned max (15 §1: max "prone to overthinking", T1 docs; high = default behavior)	T × 0.8 for max-users	(−$0.89 if applicable)
C5	Subagent exploration pinned to `model: haiku`/`sonnet` (16 §fan-out: 5-worker fan-out $2.75 → ~$0.20, T1 mechanics + advisor-pattern precedent)	fan-out days only	−$2 to −$4 those days

Main-loop total: −$1.23/day → 1.06x. With routine subagent fan-outs: up to ~1.3x. Failure modes of the composition: none interacting — items touch disjoint behavior. Validation: 31-validation-harness.md screening run (n=12) once; C1's falsifier is the failed-old_string retry rate, C2's is cache_read_input_tokens continuity across turns.

The honest conservative headline: riskless knobs are small because the platform already turned the big ones. The conservative stack's real value is not regressing (a busted cache or a restored 50k MCP schema load silently costs more than C1–C5 save).

2. Aggressive stack — adds T3 + SDK tier, validation attached

Adds: effort tiering, model routing, context editing/masking, register compression, batch. Each item carries a validation plan; adopt sequentially, validating each (the multipliers below are mid-range of the cited evidence, not best-case).

Sequential composition from $21.83:

#	Technique (file)	Multiplier	Running $/day
—	baseline		$21.83
A1	Effort `high → medium` on the ~60% routine share of work (15 §1: T1 Opus 4.5 "76% fewer output tokens" at equal SWE-bench; transfer to Fable 5 unproven → validate). Net modeled: T×0.55, V×0.70	T 4.46→2.45, V 3.65→2.56	$18.74
A2	Caveman-ultra register on visible prose (10/15 §9: T1 local 58.5% cut; prose ≈ 30% of V in mixed sessions — local floor 1.4%, chat-heavy ceiling ~100%)	V × 0.825	$18.29
A3	Context editing / observation masking on long sessions (12/14/18: vendor 84% token cut +29% perf, T1 self-eval; JetBrains masking ≈ −50% cost at parity, T2). Modeled conservatively: R×0.65, W×1.10 (clears re-form prefix)	R 7.02→4.56, W 6.38→7.02	$16.47
A4	Route half the sessions to Sonnet-main + frontier advisor (16: T1 advisor = +2.7pp AND −11.9% cost vs Sonnet-alone; prose/ASCII-heavy text can get Fable→Sonnet ≈ ÷4.3, but code-heavy work uses the list-price ÷3.3 ratio)	half-day ÷ 4.3 prose / ÷3.3 code	$10.15 prose / $10.73 code
A5	Batch API for the offline 30% (overnight sweeps, docs jobs; 18: 50% off, stacks with caching)	30% × 0.5	$8.63 prose / $9.12 code

Total ≈ $8.6–9.1/day → ≈2.4–2.5x (range = prose-heavy vs code-heavy routing plus A2 prose-share and A3 clear-frequency sensitivity).

Composition failure modes (watch these, they are real):

A1 × A2 double-press terseness — effort-medium already produces terse output; adding the register can cross the token-complexity cliff on hard tasks (15 §10). Canaries C1–C6 of the harness are mandatory after stacking both.
A3 × caching — every clear invalidates the prefix at the clearing point (18); clear_at_least must exceed the re-write cost (the W×1.10 above models this; verify applied_edits vs cache_creation in usage).
A4 × caching — the prompt cache is model-scoped; route only at session/task boundaries (16: mid-session switch ≈ 9-turn break-even).
A5 — batch is for genuinely latency-tolerant work only; misrouting interactive work to batch costs wall-clock, not dollars.

Validation: full harness confirmation run (n=30) on the composed stack as a unit, plus the effort-sweep experiment (15 §1) before and after A1 — it doubles as the missing thinking-fraction-by-effort measurement.

3. Unbelievable stack — chasing 10x

Adds BUILDABLE frontier items (20-frontier-ideas.md) and flips the main loop. Stated honestly: the quality side of U1 is the unproven hinge.

#	Addition	Mechanism	Multiplier (claimed basis)
U1	Sonnet-main everywhere + Fable/Opus advisor-escalation (16; T1 advisor pattern, T4 for parity-with-Fable on hardest tasks)	frontier model only where escalated (~20% of tokens)	remaining Fable-share ÷4.3
U2	Effort medium globally + low for mechanical subtasks (15)	thinking floor	T further ×0.8
U3	State-file session resume instead of transcript accumulation (14; T1 mechanics, savings ESTIMATE)	kills long-tail R growth	R × 0.8
U4	Session codebook + single-token anchors + identifier policy (20; T4/T1-local micro-measurements)	V, U margins	V × 0.95
U5	jackin' token-pack: all of the above baked into every launched container (20 §jackin; insertion points mapped: `launch.rs:590-717` env assembly, RoleManifest `[token_policy]`, CapsuleConfig → `build_agent_command`)	adherence → 100%, drift → 0	protects the multipliers
U6	Batch+cache stacking for all offline work (13: reads at 0.05x)	offline share	as A5, deeper

Composed (same sequential method, from the corrected Aggressive endpoint $8.63, prose-heavy case): U1 widens the routed share from half to ~all sessions (remaining Fable spend only on escalations): ≈ ×0.55 → $4.75; U2 ≈ ×0.93 → $4.42; U3 (R component) ≈ −$0.45 → $3.97; U4 ≈ −$0.10 → $3.87; U6 ≈ ×0.93 → ≈ $3.60/day → 6.1x against the $21.83 baseline (code-heavy endpoint ≈5.7x).

The paper path to 10x stretches U1 to a Haiku-drafting fleet with frontier verification and near-zero interactive frontier use (~$2.2/day). Every individual mechanism is shipped (T1); the composition's quality at parity is unmeasured (T4) — and 15 §10's token-complexity cliff plus 16's "agent teams ≈ 7x tokens" warn that cheap-model fleets can pay back their savings in retries and verification passes.

Verdict on 10x: not defensible at zero quality loss today. Defensible today: ≈2.5x with validation (Aggressive; ≈2.4x code-heavy), ≈5–6.2x if the routing flip passes your harness on your task mix. Binding constraint #1 is frontier-model thinking output ($4.46/day baseline — untouchable except by effort and by not-being-the-frontier-model); #2 is the cache-read floor of context the agent truly needs (R after A3/U3 ≈ $3.0/day is mostly useful context); #3 is quality risk concentration — every remaining big multiplier moves work off the frontier model.

4. The negative-cost set (save tokens AND improve output)

Explicitly identified across the dossier (the brief's "genuinely unbelievable" category):

Technique	Evidence	File
Tool search / MCP schema deferral	85% token cut, accuracy 49%→74% (T1 vendor)	12
Context editing	84% token cut, +29% performance (T1 vendor self-eval)	12/14/18
Observation masking	~50% cost cut at equal/better solve rate (T2)	12
Edit-tool diffs vs rewrites	89.3% cheaper (T1 local) AND quality up (aider 20%→61%, T1-dated)	15
Advisor-pattern escalation	−11.9% cost AND +2.7pp SWE-bench Multilingual (T1)	16
Effort max→high	cost down, overthinking down (T1 docs)	15
Repo map / outline instead of file dumps	−85.1/−92.5% local; mitigates context rot (T2 adjacent)	12
Pruning stale context generally	"Lost in the Middle"/context-rot literature: less can be more (T2)	12

Common thread: less junk in, less junk out — the input-architecture layer is where saving money and raising quality are the same action. Register compression is conspicuously not in this set (it is paid for in readability and caveat-risk), and that is the dossier's central correction to the operator's starting intuition.

Verification ledger

Number	Basis
Baseline class table	01 §5 ($22 variant), local Phase-0 measurement scaled
−86.3% caching, $71.59 vs $524.23; 320/320 1h-TTL; 1,310:1 read ratio	13, local session measurement + GH #24147
89.3% Edit-vs-Write; 352 tok restatement; 58.5% caveman-ultra	15/02, local count_tokens
76% fewer output tokens at medium effort (Opus 4.5, SWE-bench)	anthropic.com/news/claude-opus-4-5 via 15
84% / +29% / +39% context-management numbers	claude.com/blog/context-management via 12/14/18
85% tool-search cut, 49%→74%	anthropic.com/engineering/advanced-tool-use via 12
Advisor +2.7pp / −11.9%	16 (Claude Code docs/release notes)
Fable→Sonnet routing ratio	16/50: ≈3.3x list-price on code/CJK-heavy work; up to ≈4.3x on prose/ASCII-heavy text after tokenizer premium
Masking ≈ −50% at parity	arXiv 2508.21433 via 12
9-turn break-even on mid-session model switch	16, ESTIMATE from cache mechanics
All stack totals	ESTIMATE — sequential class arithmetic shown in tables above

30 — Composed Stacks: Conservative / Aggressive / Unbelievable

On this page