jackin'
ResearchToken Optimization Research

30 — Composed Stacks: Conservative / Aggressive / Unbelievable

30 — Composed Stacks: Conservative / Aggressive / Unbelievable

This file does the end-to-end dollar math. Savings are composed per token class, sequentially (multipliers on the class a technique actually touches) — never by multiplying headline percentages across different classes.

TL;DR

  • The defaults already bank the first ~4–5x. Prompt caching (−86.3% input-side, measured live), MCP schema deferral, Edit-tool diffs, and 1h-TTL main-loop caching are Claude Code defaults in 2026 — the baseline this dossier optimizes is already the optimized world of 2024-era folklore. Most "easy 10x" claims are unknowingly re-selling the defaults.
  • Conservative stack (T1/T2, zero quality risk, Claude-Code-today): 1.06x main-loop, up to ~1.3x on fan-out days. The honest number nobody markets: with good defaults, riskless config-level wins are small.
  • Aggressive stack (adds T3 + SDK tier + validation plans): ≈2.5x ($21.83 → ~$8.6/day on the modeled profile), led by effort-tiering, model routing at task boundaries, context editing, and register compression.
  • Unbelievable stack (everything defensible incl. BUILDABLE frontier): ≈5–6.2x at the modeled profile; a paper path to ~10x exists only by making Sonnet the main loop with frontier-model escalation — quality parity there is unproven (T4), so 10x is NOT defensible at zero quality loss today.
  • Binding constraints, in order: (1) frontier-model thinking output — no API lever except effort touches it; (2) the cache-read floor of context the agent genuinely uses; (3) quality risk of cheap-model main loops on the hardest tasks.

0. Baseline and composition rules

Working profile (01-economics-and-measurement.md §5, $22 variant = 6 × measured session):

ClassSymbolTokens/day$/day
Uncached inputU33k$0.33
Cache writesW510k$6.38
Cache readsR7.02M$7.02
Output — thinking (55%)T89k$4.46
Output — visible (45%)V73k$3.65
Total$21.83

Composition rules: a technique is a multiplier on one or two classes; sequential application (order shown); cross-class couplings carried explicitly (shorter outputs → slower transcript growth → smaller future R; context clears → extra W). Every multiplier cites its source file. Sensitivity: the $17/45%-thinking floor profile shifts totals ~−22% but does not change any ratio (same multipliers).

What the baseline already includes (do not double-count as savings): prompt caching (13: measured −86.3% input-side vs uncached equivalent — $71.59 paid vs $524.23 on this very session), MCP deferral by default (12), Edit-tool default (15), 1h-TTL main-loop writes (13: 320/320 calls observed), prior-turn thinking billed per current docs (18).

1. Conservative stack — "do this tomorrow"

Constraints: T1/T2 evidence only, CLAUDE-CODE-TODAY only, zero quality risk (each item is NEGATIVE-COST or NEUTRAL with a trivial falsification check).

#Technique (file)Class effect$/day
C1Edit-over-Write guard + no-restatement rules in CLAUDE.md (15 §2–3; T1 local: 89.3% per avoided rewrite, 352 tok per avoided restatement)V × 0.80−$0.73
C2Cache hygiene: never /model//effort mid-session; stable MCP set; no mid-session CLAUDE.md additions to the prefix path that re-form it (16 §cache; 13 §invalidators; T1: one avoided 150k bust ≈ $0.43)W − bust−$0.43
C3Instruction-mass audit: keep root file lean (10 §5: −60% of a 2.7k-token file ≈ $0.05/session; this repo is already lean)R × 0.99−$0.07
C4Effort max → high for users who pinned max (15 §1: max "prone to overthinking", T1 docs; high = default behavior)T × 0.8 for max-users(−$0.89 if applicable)
C5Subagent exploration pinned to model: haiku/sonnet (16 §fan-out: 5-worker fan-out $2.75 → ~$0.20, T1 mechanics + advisor-pattern precedent)fan-out days only−$2 to −$4 those days

Main-loop total: −$1.23/day → 1.06x. With routine subagent fan-outs: up to ~1.3x. Failure modes of the composition: none interacting — items touch disjoint behavior. Validation: 31-validation-harness.md screening run (n=12) once; C1's falsifier is the failed-old_string retry rate, C2's is cache_read_input_tokens continuity across turns.

The honest conservative headline: riskless knobs are small because the platform already turned the big ones. The conservative stack's real value is not regressing (a busted cache or a restored 50k MCP schema load silently costs more than C1–C5 save).

2. Aggressive stack — adds T3 + SDK tier, validation attached

Adds: effort tiering, model routing, context editing/masking, register compression, batch. Each item carries a validation plan; adopt sequentially, validating each (the multipliers below are mid-range of the cited evidence, not best-case).

Sequential composition from $21.83:

#Technique (file)MultiplierRunning $/day
baseline$21.83
A1Effort high → medium on the ~60% routine share of work (15 §1: T1 Opus 4.5 "76% fewer output tokens" at equal SWE-bench; transfer to Fable 5 unproven → validate). Net modeled: T×0.55, V×0.70T 4.46→2.45, V 3.65→2.56$18.74
A2Caveman-ultra register on visible prose (10/15 §9: T1 local 58.5% cut; prose ≈ 30% of V in mixed sessions — local floor 1.4%, chat-heavy ceiling ~100%)V × 0.825$18.29
A3Context editing / observation masking on long sessions (12/14/18: vendor 84% token cut +29% perf, T1 self-eval; JetBrains masking ≈ −50% cost at parity, T2). Modeled conservatively: R×0.65, W×1.10 (clears re-form prefix)R 7.02→4.56, W 6.38→7.02$16.47
A4Route half the sessions to Sonnet-main + frontier advisor (16: T1 advisor = +2.7pp AND −11.9% cost vs Sonnet-alone; prose/ASCII-heavy text can get Fable→Sonnet ≈ ÷4.3, but code-heavy work uses the list-price ÷3.3 ratio)half-day ÷ 4.3 prose / ÷3.3 code$10.15 prose / $10.73 code
A5Batch API for the offline 30% (overnight sweeps, docs jobs; 18: 50% off, stacks with caching)30% × 0.5$8.63 prose / $9.12 code

Total ≈ $8.6–9.1/day → ≈2.4–2.5x (range = prose-heavy vs code-heavy routing plus A2 prose-share and A3 clear-frequency sensitivity).

Composition failure modes (watch these, they are real):

  • A1 × A2 double-press terseness — effort-medium already produces terse output; adding the register can cross the token-complexity cliff on hard tasks (15 §10). Canaries C1–C6 of the harness are mandatory after stacking both.
  • A3 × caching — every clear invalidates the prefix at the clearing point (18); clear_at_least must exceed the re-write cost (the W×1.10 above models this; verify applied_edits vs cache_creation in usage).
  • A4 × caching — the prompt cache is model-scoped; route only at session/task boundaries (16: mid-session switch ≈ 9-turn break-even).
  • A5 — batch is for genuinely latency-tolerant work only; misrouting interactive work to batch costs wall-clock, not dollars.

Validation: full harness confirmation run (n=30) on the composed stack as a unit, plus the effort-sweep experiment (15 §1) before and after A1 — it doubles as the missing thinking-fraction-by-effort measurement.

3. Unbelievable stack — chasing 10x

Adds BUILDABLE frontier items (20-frontier-ideas.md) and flips the main loop. Stated honestly: the quality side of U1 is the unproven hinge.

#AdditionMechanismMultiplier (claimed basis)
U1Sonnet-main everywhere + Fable/Opus advisor-escalation (16; T1 advisor pattern, T4 for parity-with-Fable on hardest tasks)frontier model only where escalated (~20% of tokens)remaining Fable-share ÷4.3
U2Effort medium globally + low for mechanical subtasks (15)thinking floorT further ×0.8
U3State-file session resume instead of transcript accumulation (14; T1 mechanics, savings ESTIMATE)kills long-tail R growthR × 0.8
U4Session codebook + single-token anchors + identifier policy (20; T4/T1-local micro-measurements)V, U marginsV × 0.95
U5jackin' token-pack: all of the above baked into every launched container (20 §jackin; insertion points mapped: launch.rs:590-717 env assembly, RoleManifest [token_policy], CapsuleConfig → build_agent_command)adherence → 100%, drift → 0protects the multipliers
U6Batch+cache stacking for all offline work (13: reads at 0.05x)offline shareas A5, deeper

Composed (same sequential method, from the corrected Aggressive endpoint $8.63, prose-heavy case): U1 widens the routed share from half to ~all sessions (remaining Fable spend only on escalations): ≈ ×0.55 → $4.75; U2 ≈ ×0.93 → $4.42; U3 (R component) ≈ −$0.45 → $3.97; U4 ≈ −$0.10 → $3.87; U6 ≈ ×0.93 → ≈ $3.60/day → 6.1x against the $21.83 baseline (code-heavy endpoint ≈5.7x).

The paper path to 10x stretches U1 to a Haiku-drafting fleet with frontier verification and near-zero interactive frontier use (~$2.2/day). Every individual mechanism is shipped (T1); the composition's quality at parity is unmeasured (T4) — and 15 §10's token-complexity cliff plus 16's "agent teams ≈ 7x tokens" warn that cheap-model fleets can pay back their savings in retries and verification passes.

Verdict on 10x: not defensible at zero quality loss today. Defensible today: ≈2.5x with validation (Aggressive; ≈2.4x code-heavy), ≈5–6.2x if the routing flip passes your harness on your task mix. Binding constraint #1 is frontier-model thinking output ($4.46/day baseline — untouchable except by effort and by not-being-the-frontier-model); #2 is the cache-read floor of context the agent truly needs (R after A3/U3 ≈ $3.0/day is mostly useful context); #3 is quality risk concentration — every remaining big multiplier moves work off the frontier model.

4. The negative-cost set (save tokens AND improve output)

Explicitly identified across the dossier (the brief's "genuinely unbelievable" category):

TechniqueEvidenceFile
Tool search / MCP schema deferral85% token cut, accuracy 49%→74% (T1 vendor)12
Context editing84% token cut, +29% performance (T1 vendor self-eval)12/14/18
Observation masking~50% cost cut at equal/better solve rate (T2)12
Edit-tool diffs vs rewrites89.3% cheaper (T1 local) AND quality up (aider 20%→61%, T1-dated)15
Advisor-pattern escalation−11.9% cost AND +2.7pp SWE-bench Multilingual (T1)16
Effort max→highcost down, overthinking down (T1 docs)15
Repo map / outline instead of file dumps−85.1/−92.5% local; mitigates context rot (T2 adjacent)12
Pruning stale context generally"Lost in the Middle"/context-rot literature: less can be more (T2)12

Common thread: less junk in, less junk out — the input-architecture layer is where saving money and raising quality are the same action. Register compression is conspicuously not in this set (it is paid for in readability and caveat-risk), and that is the dossier's central correction to the operator's starting intuition.

Verification ledger

NumberBasis
Baseline class table01 §5 ($22 variant), local Phase-0 measurement scaled
−86.3% caching, $71.59 vs $524.23; 320/320 1h-TTL; 1,310:1 read ratio13, local session measurement + GH #24147
89.3% Edit-vs-Write; 352 tok restatement; 58.5% caveman-ultra15/02, local count_tokens
76% fewer output tokens at medium effort (Opus 4.5, SWE-bench)anthropic.com/news/claude-opus-4-5 via 15
84% / +29% / +39% context-management numbersclaude.com/blog/context-management via 12/14/18
85% tool-search cut, 49%→74%anthropic.com/engineering/advanced-tool-use via 12
Advisor +2.7pp / −11.9%16 (Claude Code docs/release notes)
Fable→Sonnet routing ratio16/50: ≈3.3x list-price on code/CJK-heavy work; up to ≈4.3x on prose/ASCII-heavy text after tokenizer premium
Masking ≈ −50% at parityarXiv 2508.21433 via 12
9-turn break-even on mid-session model switch16, ESTIMATE from cache mechanics
All stack totalsESTIMATE — sequential class arithmetic shown in tables above

On this page