03 — Prior art and market scan

TL;DR

All four operator-named tools exist, verified from primary pages: caveman (71,891 stars) compresses visible prose; cavemem (521) stores pre-compressed memory; cavekit (1,014) is a spec-driven build loop; fff (8,351) is a resident file-search MCP with latency numbers (3–9 s -> sub-10 ms) but no published token percentage.
The market crowds onto the two smallest cost buckets of a real heavy session — visible prose (17% of dollars) and memory files (slice of the 2% uncached input) — while nothing ships for the big three: cache reads (32%), cache writes (29%), thinking (20%) (local measurement; see 01-economics-and-measurement.md).
The best-evidenced technique is negative-cost and already a Claude Code default: MCP tool deferral / Tool Search Tool — 85% tool-token cut with accuracy up 49% -> 74% (Anthropic); locally 1,420 -> ~60 tok (96%) on an 11-tool fleet.
Caching is the floor, not free money: on the modeled heavy day it already turns ~$84 into ~$22 (-74%, ESTIMATE from local profile + live multipliers) — the remaining bill is 61% cache reads+writes, and any technique that breaks prefix stability must beat ~5.5x compression just to break even (ESTIMATE, record 19).
Local kills (fresh, /tmp/ct.py, Fable-family tokenizer): TOON saves 41.2% vs minified JSON, not "60%" (34.3% of the headline is just minification); caveman's "~75%" replicates as 58.5% on tokens; the Fable-vs-Sonnet tokenizer gap is content-specific — prose/ASCII gets the premium, but code/CJK probes were near-neutral.

Scope and method

Scanned: the four operator-named tools, Claude Code's official cost levers, the plugin/MCP ecosystem, gateways, and research-grade prompt compression. Every external claim carries URL +. Local numbers use /tmp/ct.py (free count_tokens endpoint; the single-user-message wrapper adds a constant 7 tokens — percentages below are conservative by <1 point). Dollar arithmetic uses the modeled heavy-day profile from 01-economics-and-measurement.md: ~6 sessions/day, ~19 calls/session, 5.5k uncached input, 85k cache-write, 1.17M cache-read, 27k output (54.8% thinking), ~$22/day split 32% cache-read / 29% cache-write / 20% thinking / 17% visible output / 2% uncached input. Validation protocols are designed to run on the harness in 31-validation-harness.md; this machine's own overhead baseline is in 02-baseline-audit.md.

Market map

Bucket	Members (record #)
Industry standard (automatic or near-universal)	prompt caching (05), /clear+/compact hygiene (06), CLAUDE.md slimming (07), model tiering (08), effort control (09), MCP tool deferral (10)
Significant industry use	caveman (01), fff (04), TOON serialization (14), claude-mem (15), aider repo map (16), LiteLLM gateway (17), ccusage + /usage (18)
Proven tactics (vendor or peer-reviewed numbers, partial availability)	programmatic tool calling (11), context editing + memory tool (12), batch API + team discipline (13), LLMLingua family (19)
Engineer-verified facts and patterns	preprocessing/hooks filtering (20), fixed-overhead accounting (21)
Named-tool early adopters (low adoption, weak numbers)	cavemem (02), cavekit (03)

Local measurements run for this file (/tmp/ct.py, claude-fable-5)

Serialization — identical 10-row x 4-field uniform dataset, plus a nested-config control:

Sample	Tokens	vs pretty JSON	vs minified JSON
Pretty JSON (2-space)	484	—	—
Minified JSON	318	-34.3%	—
TOON	187	-61.4%	-41.2%
Nested config, pretty	272	—	—
Nested config, minified	194	-28.7%	—

An independent same-day run (sweep agent, different dataset, same shape) got -60.6%/-40.8%/-33.6% — agreement within ~1 point.

Tool-result filtering — synthetic 630-line cargo-test log (577 pass, 3 failure blocks), filtered with grep -B1 -A6 -E "FAILED|panicked|error\[" | head -100. Repetitive pass-lines are the favorable (and typical) case; the honest general claim stays "80–99%":

Sample	Lines	Tokens	Cut
Raw log	630	10,108	—
Filtered (all 3 failures fully preserved)	35	590	-94.2%

Cross-tokenizer, identical fixed text (7-tok wrapper included):

Sample	Fable 5	Sonnet 4.6	Haiku 4.5	Fable premium
English prose, 730 chars	224	157	157	+42.7% (+44.7% net of wrapper)
Rust code, 1,800 bytes (`crates/jackin-build-meta/src/lib.rs`)	777	558	558	+39.2% (+39.7% net)

Sonnet 4.6 and Haiku 4.5 return byte-identical counts — shared tokenizer confirmed. Phase-0 (same day, other samples) got +15% (code) / +38% (prose); a tighter probe gives prose +35%, code −3%, CJK −4% — the premium is prose/ASCII-specific, not code-wide. Treat the Rust row above as sample-specific.

Self-cost of the optimizer: the caveman plugin family's always-on skill listing (7 skills, names + descriptions as injected this session) = 940 tokens of prefix per session — ~0.5% of the modeled day in cache-read rent before it saves anything. Root AGENTS.md re-measured at 2,744 tokens (phase-0: 2,738; 0.2% drift).

The four operator-named tools

01. caveman (JuliusBrussee/caveman) — terse-register output compression: "why use many token when few do trick"; README headline "cuts ~75% of output tokens"; 71,891 stars (gh api)

Layer: output (visible prose only). Mechanism: skill-prompt register change — lite/full/ultra strip filler, wenyan variants answer in Classical Chinese; companions: caveman-compress (memory files, "~46% input tokens"), cavecrew subagents ("~60% fewer tokens than vanilla"), caveman-commit/review; sibling caveman-code claims "~2x fewer tokens than Codex".
Expected savings: visible output = 17% of modeled-day dollars ($3.74). Ultra locally cuts 58.5% of prose tokens; code/diffs pass through verbatim, so realistic whole-bill effect ~4–6%/day (ESTIMATE: 17% x 58.5% x prose share; independent mayhemcode estimate ~4%), hard ceiling ~10%.
Evidence tier: T3, partially replicated locally. README benchmark : 10 tasks, per-task mean 65% (range 22–87%), pooled 1214 -> 294 = 75.8% — the "~75%" headline is the pooled ratio, 65% the per-task mean, both off one table, against an "Answer concisely." baseline, reproduction scripts in benchmarks/. Local : ultra 58.5%; wenyan-full 56.6% tokens (80.9% chars); wenyan-ultra 74.5%.
Quality risk: QUALITY-TRADE at ultra/wenyan (lossy register, unreadable transcripts); ~NEUTRAL at lite/full. The README concedes the economics: "Caveman only affects output tokens — thinking/reasoning tokens untouched... cost savings a bonus", and cites arXiv 2604.00025 (repo-cited, unverified) claiming brevity improved accuracy 26 points. Degradation = dropped caveats/rationale; falsify by blind-grading paired answers.
Availability: CLAUDE-CODE-TODAY (plugin marketplace; installed in this environment). Effort to adopt: minutes.
Composability: stacks with caching/hygiene/deferral; does not touch thinking (54.8% of output tokens locally); its own skill listing costs 940 tok/session of prefix (above); wenyan modes meet an unfavorable CJK tokenizer (~1.47 chars/tok).
Validation protocol: 20 fixed tasks, caveman-ultra vs "answer concisely"; visible-output tokens from JSONL usage minus thinking; blind-grade completeness on a 5-point rubric; require <=0.5-point drop; report per-register deltas.

02. cavemem (JuliusBrussee/cavemem) — pre-compressed persistent memory over MCP; claims "~75% fewer prose tokens"; 521 stars (gh api)

Layer: input (memory/context injection). Mechanism: session event -> redact -> caveman-compress -> SQLite+FTS5; MCP retrieval (BM25 + vector blend, alpha 0.5), progressive disclosure (snippets first, bodies on demand); code/URLs/paths preserved verbatim.
Expected savings: unmeasurable from published data — no benchmark table, one worked example. On the modeled day, injection lands in the 29% cache-write + 32% cache-read lines; the offset (avoided re-exploration) is unquantified by anyone. Given local register data, expect ~55–60% on stored prose, not 75% (ESTIMATE).
Evidence tier: T4 verging T3-weak (README; no independent replication found; adds its own MCP schema overhead).
Quality risk: RISKY — lossy memory drops nuance; retrieval ranking unmeasured; degradation = confidently-wrong recalled "facts". Falsify by quizzing the store against original transcripts.
Availability: CLAUDE-CODE-TODAY (MCP; also Cursor, Gemini CLI, OpenCode, Codex). Effort to adopt: minutes-to-hours.
Composability: family stack ("cavekit orchestrates, caveman compresses output, cavemem compresses memory" — getcaveman.dev); competes with claude-mem (15); deferral (10) absorbs its schema cost.
Validation protocol: one-week A/B (memory on/off) on one project; ccusage day totals; injected-context tokens vs re-exploration tokens saved; separately ct.py-verify the ~75% claim on 50 real observations.

03. cavekit (JuliusBrussee/cavekit) — spec-driven autonomous build loop; caveman-encoded blueprints "~75% fewer tokens than prose"; 1,014 stars (gh api)

Layer: turn-structure / orchestration artifacts. Mechanism: NL -> compressed blueprint kits -> parallel build -> verification; durable repo-root SPEC.md means /clear or compaction does not force re-deriving intent. Token relevance is secondary to the workflow change.
Expected savings: no end-to-end number exists anywhere . Artifact compression touches a slice of the 2% uncached + write lines; the loop itself can multiply total calls — net sign unknown.
Evidence tier: T4 — token claims inherited from caveman encoding; no published cavekit-vs-vanilla measurement.
Quality risk: RISKY — autonomous iteration can spend far more than it saves; degradation = ballooning call counts per feature. Falsify with the protocol below.
Availability: CLAUDE-CODE-TODAY (plugin). Effort to adopt: hours-to-days (whole-workflow change).
Composability: designed for the caveman family; orthogonal to caching; the durable-spec idea composes with /clear (06) for free.
Validation protocol: build the same 3 small features via cavekit and via a vanilla plan->implement session; compare total day tokens (ccusage), iteration counts, and diff quality; require equal-or-better diffs.

04. fff (dmtrKovalenko/fff) — resident file-search MCP: "Fewer grep roundtrips, less wasted context, faster answers" (README); 8,351 stars (gh api)

Layer: turn-structure (tool-call count) + tool-result tokens. Mechanism: always-warm Rust index (no per-query spawn or .gitignore re-read); MCP tools ffgrep/fffind with frecency ranking, git-aware annotations, weak-match detection ("flags scattered fuzzy noise before it floods the agent's context"), fuzzy fallback avoiding zero-result dead ends — each avoided dead end is one avoided tool_use/tool_result pair.
Expected savings: token side unquantified by the project — README text publishes latency ("3-9 SECONDS per ripgrep spawn and sub-10 ms per FFF query" on a 500k-file Chromium checkout; ~26 MB resident on 14k files) and a token-efficiency chart image, but no percentage in text. Tool results ride the 29%+32% cache lines; direction favorable, magnitude unknown.
Evidence tier: T1 for latency (published, reproducible); T4 for tokens (qualitative: "faster and more token-efficient than the built-in one").
Quality risk: potential NEGATIVE-COST — better-ranked results mean fewer wrong-file reads; unproven on the token axis. Degradation = fuzzy false-positives polluting context; falsify via the A/B below.
Availability: CLAUDE-CODE-TODAY (MCP; also Neovim/Rust/C/Node). Effort to adopt: minutes.
Composability: competes with native Grep/Glob and LSP-style code intelligence (official docs: a "go to definition" call replaces grep + reading candidates); opposite philosophy to aider's map (16): ranked search vs ranked map.
Validation protocol: fixed 20-query search suite on this repo, native Grep/Glob vs fff MCP; sum tool_result tokens + call counts from session JSONL; require equal-or-better target-file hit rate.

Industry standard (automatic or near-universal)

05. Prompt caching — reads 0.1x, writes 1.25x/2x, automatic in Claude Code; the single biggest shipped lever

Layer: cache / price multiplier. Mechanism: prefix storage with free refresh-on-hit; 512-token minimum on Fable 5; Claude Code places breakpoints automatically (platform.claude.com/docs/en/build-with-claude/prompt-caching).
Expected savings: already banked. Counterfactual on the modeled day: without caching, reads bill 10x — day ~$84 vs $22 = -74% (ESTIMATE). Vendor launch table: up to -90% (book chat), -53% (multi-turn) (claude.com/blog/prompt-caching). The remaining bill is still 61% reads+writes — the target of 13-caching-exploitation.md.
Evidence tier: T1 (live pricing + launch table); locally corroborated: heavy-session prompt mix 0.44% uncached / 6.73% write / 92.83% read (local measurement).
Quality risk: NEGATIVE-COST (pure price, no behavior change). Failure mode is economic: prefix churn converts 0.1x reads into 1.25x writes; falsify by diffing usage fields across a deliberately churned prefix.
Availability: CLAUDE-CODE-TODAY (automatic) / SDK (explicit breakpoints, 1h TTL). Effort to adopt: zero in Claude Code; hours of prefix-stability discipline in SDK.
Composability: stacks with batch (13); undermined by anything mutating the prefix — model switch (08), compaction (06), dynamic system prompts, LLMLingua (19).
Validation protocol: from session JSONL, compute hit ratio and re-price the same usage at no-cache rates (confirm ~74% on your traffic); check whether Claude Code ever issues 1h-TTL writes (open question affecting the 29% write line).

06. Context hygiene — /clear, steered /compact, auto-compact at ~83% capacity (~167k) with ~33k reserved buffer

Layer: turn-structure / context window. Mechanism: /clear resets ("Stale context wastes tokens on every subsequent message" — code.claude.com/docs/en/costs); /compact takes focus instructions; auto-compact summarizes near the threshold.
Expected savings: unquantified by Anthropic. Independent JSONL trace: typical sessions = 76% useful work / 7% compaction summaries / 17% reserved headroom (dev.to slima4). Earlier /clear directly shrinks the 1.17M cache-read line (32% of modeled-day dollars) on every subsequent call.
Evidence tier: T1 (official docs) + T3 (trace with thresholds).
Quality risk: QUALITY-TRADE — summaries lose detail; degradation = re-asking answered questions post-compaction. Falsify by quizzing post-compaction state against pre-compaction facts.
Availability: CLAUDE-CODE-TODAY. Effort to adopt: minutes (habits + compact instructions in CLAUDE.md).
Composability: compaction rewrites the prefix — a hidden cache-write spike (anti-synergy with 05); /clear is cheaper than /compact when continuity is not needed; cavekit's durable spec (03) reduces what compaction must preserve.
Validation protocol: capture usage.cache_creation_input_tokens around 10 compaction events to quantify the write spike; compare task success for /clear-per-task vs run-long-and-compact.

07. CLAUDE.md slimming + path-scoped rules + skills offload — official guidance "Aim to keep CLAUDE.md under 200 lines"

Layer: input (per-call fixed prefix). Mechanism: memory file rides every call's prefix; path-scoped rules load per subtree (this repo already does this via per-directory agent rule files); skills carry only name+description (30–100 tok) until invoked (code.claude.com/docs/en/costs).
Expected savings: honest arithmetic is small in dollars: root AGENTS.md = 2,744 tok (local) -> x19 calls x6 sessions x $1/MTok cache-read = $0.31/day rent (~1.4% of the modeled day); an 80% trim saves ~$0.25/day plus window headroom. Firecrawl: 3,847 -> 312 tok = 91.9% on the file itself, "no quality regression"; 41% from path-scoping (firecrawl.dev blog).
Evidence tier: T1 (official guidance) + T3 (Firecrawl, vendor-interested) + local file measurement.
Quality risk: RISKY if overdone — one bad PR from a dropped rule erases months of rent savings; degradation = rule violations. Falsify by replaying rule-sensitive tasks against the slimmed file.
Availability: CLAUDE-CODE-TODAY. Effort to adopt: hours (editorial; caveman-compress automates lossily at claimed ~46%).
Composability: multiplies with caching (smaller prefix = smaller perpetual rent); synergy with skills/path-scoping; feeds record 21's denominator.
Validation protocol: ct.py the file before/after; replay the 10 most rule-sensitive recent tasks on both versions; zero new rule violations allowed; log the $-rent delta.

08. Model tiering — Fable 5 $10/$50, Opus 4.8 $5/$25, Sonnet 4.6 $3/$15, Haiku 4.5 $1/$5 — with the tokenizer trap

Layer: infra / price-per-token and token count. Mechanism: /model switching, plan-on-big/build-on-small, subagent model pinning. Pricing-page note: "Opus 4.7 and later use a new tokenizer... may use up to 35% more tokens for the same fixed text" (platform.claude.com/docs/en/about-claude/pricing).
Expected savings: independent trace: same session ~$5.50 on Sonnet vs ~$10+ on Opus (dev.to slima4). Corrected cross-boundary math: use 3.33x Fable:Sonnet on code/CJK-heavy text and ~4.3x on prose/ASCII-heavy text after the tokenizer premium (ESTIMATE; verification). The sweep-agent draft of this file said "effectively ~2.4–2.9x" — that is the same arithmetic inverted (division for multiplication); corrected here: downgrades save at least sticker, and prose-heavy downgrades save more.
Evidence tier: T1 (prices + official note) + local measurement; both fresh local samples exceed the official 35% ceiling — flagged.
Quality risk: QUALITY-TRADE — capability cliff on hard tasks; degradation = failed/looping attempts on the small model (retries can cost more than the discount). Falsify with $-per-solved-task, never $/MTok.
Availability: CLAUDE-CODE-TODAY. Effort to adopt: minutes.
Composability: model switch invalidates the prompt cache (anti-synergy with 05) and changes the tokenizer basis mid-ledger; $/MTok routers (17) mis-price cross-boundary moves by up to ~1.45x.
Validation protocol: fixed 10-task set on Fable 5 / Opus 4.8 / Sonnet 4.6; record solved-or-not + total $ from JSONL; tokenizer-normalize via ct.py on one canonical text per model; report $/solved-task.

09. Thinking/effort control (/effort, MAX_THINKING_TOKENS) — the only shipped lever on the largest output component

Layer: output (thinking; bills as output, $50/MTok on Fable 5). Mechanism: effort low/medium/high shapes reasoning depth; MAX_THINKING_TOKENS=8000 for fixed-budget models; adaptive-reasoning models ignore nonzero budgets; Fable 5 cannot disable thinking; default budgets "can be tens of thousands of tokens per request" (code.claude.com/docs/en/costs).
Expected savings: vendor: "Set to a medium effort level, Opus 4.5 matches Sonnet 4.5's best score on SWE-bench Verified, but uses 76% fewer output tokens"; highest effort +4.3 points at -48% (anthropic.com/news/claude-opus-4-5; no Fable 5 curve published). Local ceiling: thinking = 54.8% of output tokens, 20% of modeled-day dollars ($4.40); halving thinking ~= $2.20/day (ESTIMATE).
Evidence tier: T1 (vendor numbers, model-version-specific) + local decomposition (n=1).
Quality risk: QUALITY-TRADE — extended thinking "significantly improves performance on complex planning" (docs); degradation = shallow plans/missed edge cases on hard tasks. Falsify per task type: easy tasks at low effort should show zero quality delta.
Availability: CLAUDE-CODE-TODAY (/effort, env var) / SDK (effort parameter). Effort to adopt: minutes.
Composability: orthogonal to all input-side levers; the ONLY shipped tool touching the bucket caveman (01) explicitly skips. White-space: per-task-type effort routing (Gaps 1).
Validation protocol: 20 tasks stratified easy/hard at high vs medium effort; thinking tokens = JSONL output_tokens minus count_tokens of visible blocks; require no hard-task regression; report $/day delta.

10. MCP tool deferral / Tool Search Tool — the proven negative-cost optimization, now a Claude Code DEFAULT

Layer: input (tool schemas). Mechanism: schemas stay out of context (names only) until needed; API Tool Search Tool queries a registry; Claude Code defers MCP definitions by default ("only tool names enter context until Claude uses a specific tool" — code.claude.com/docs/en/costs). This research session itself runs on ToolSearch.
Expected savings: vendor: "an 85% reduction in token usage" (context preserved 191,300 vs 122,800 of 200k), MCP-eval accuracy 49% -> 74% (Opus 4) and 79.5% -> 88.1% (Opus 4.5); five-server example ~55K schema tokens; internal fleets hit 134K pre-optimization (anthropic.com/engineering/advanced-tool-use). Local: 11 schemas = 1,420 tok loaded vs ~60 deferred = 96% (local measurement). Modeled day: an undeferred 20k fleet = 20k x19 x6 x $1/MTok = $2.28/day (~10%) read rent (ESTIMATE).
Evidence tier: T1 + local reproduction (consistent direction and magnitude).
Quality risk: NEGATIVE-COST — vendor evals show accuracy rises (less irrelevant-tool confusion). Residual failure mode: a needed deferred tool goes unfound; falsify with task suites requiring rare tools.
Availability: CLAUDE-CODE-TODAY (default) / SDK. Effort to adopt: zero; audit via /context and /mcp.
Composability: makes large MCP fleets viable; complements CLI-over-MCP (20); absorbs the schema cost of cavemem (02) and fff (04).
Validation protocol: /context audit loaded-vs-deferred; fixed task set exercising 3 rarely-used tools; require 100% tool-discovery success; report prefix-token delta.

Proven tactics (vendor or peer-reviewed numbers)

11. Programmatic tool calling / code-execution orchestration — 37% average cut with accuracy gains

Layer: turn-structure / tool-result tokens. Mechanism: model writes code calling tools in a sandbox; intermediate results never enter context; community "Code Mode" collapses 12-turn MCP workflows into 4 (thenewstack.io).
Expected savings: "Average usage dropped from 43,588 to 27,297 tokens, a 37% reduction on complex research tasks"; accuracy up (25.6 -> 28.5 internal retrieval; 46.5 -> 51.2 GIA) (anthropic.com/engineering/advanced-tool-use). On the modeled day, tool results ride the 29%+32% cache lines; a 37% cut on tool-heavy turns plausibly reaches low-double-digit % of day (ESTIMATE, workload-dependent).
Evidence tier: T1 (vendor evals).
Quality risk: NEGATIVE-COST per vendor numbers; failure mode is sandbox bugs silently eating results. Falsify by checksumming sandbox outputs against direct calls.
Availability: SDK (API feature); Claude Code analog = Bash piping, hooks, subagent offload — not the API feature itself (absent from the costs-page levers). Effort to adopt: days (sandbox + re-plumbing).
Composability: stacks with Tool Search (10); same philosophy as preprocessing (20).
Validation protocol: port 5 multi-tool workflows to code-execution; require bit-identical final artifacts; compare total tokens.

12. Context editing + memory tool (API beta) — 84% token cut on a 100-turn eval while improving performance 39%

Layer: cache / context window (API-level). Mechanism: server-side auto-clearing of stale tool calls near the limit; memory tool persists state outside the window.
Expected savings: "reducing token consumption by 84%" on a 100-turn web-search eval; +39% (memory+editing) / +29% (editing alone) performance (claude.com/blog/context-management). Maps onto the 32% read line for SDK agents.
Evidence tier: T1 (vendor evals; search-task domain, NOT code-editing).
Quality risk: NEGATIVE-COST on agentic search per vendor; unproven on code workloads where an evicted tool result may be needed 40 turns later. Falsify on SWE-bench-style tasks before trusting.
Availability: SDK (beta header); NOT-USER-ACCESSIBLE as a tunable inside Claude Code (which ships compaction instead). Effort to adopt: hours (flag + config).
Composability: edits invalidate cached prefix sections — same cache-vs-pruning tension as compaction (06); the user-side gap is Gaps 4.
Validation protocol: replicate the eval shape on 10 long code sessions; measure tokens + task success vs a no-editing control.

13. Batch API (50% off) + agent-team cost discipline (~7x warning)

Layer: infra / price multiplier and session multiplicity. Mechanism: async batch = "50% discount on both input and output tokens", stacks with caching (platform.claude.com/docs/en/about-claude/pricing); teams = "approximately 7x more tokens than standard sessions when teammates run in plan mode" (code.claude.com/docs/en/costs).
Expected savings: moving offline-able work (CI review, nightly triage, evals) to batch: $22 -> $11 on the modeled day upper bound (realistic = the offline fraction); batch+Haiku stack ~5% of Fable-interactive unit cost (ESTIMATE: 0.5 x 0.1 input ratio). Team discipline is loss-avoidance: one env flag (CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1) swings cost more than every compression plugin in this scan combined.
Evidence tier: T1 (both verbatim from live Anthropic pages).
Quality risk: NEUTRAL (batch trades latency only); teams degrade economically, not qualitatively. Falsify batch quality by diffing batch-vs-interactive outputs on identical jobs.
Availability: SDK (batch) / CLAUDE-CODE-TODAY (teams flag). Effort to adopt: days (re-architect offline jobs); zero (don't over-spawn).
Composability: batch x caching x Haiku is the deepest legitimate price stack; teams anti-compose with everything.
Validation protocol: run the nightly PR-review job both ways for a week; require equal review-finding recall; report $ delta.

14. Serialization-format engineering: TOON + JSON minification [LOCALLY VERIFIED] — a third of the headline is just minifying

Layer: input (structured data in prompts / tool results). Mechanism: TOON replaces repeated JSON keys/braces with a schema header + CSV-like rows.
Expected savings: LOCAL TABLE ABOVE : TOON -61.4% vs pretty, -41.2% vs minified; minification alone -34.3% (tabular) / -28.7% (nested) — free and riskless. Official: 39.9% fewer tokens at equal-or-better retrieval accuracy (76.4% vs 75.0%) (toonformat.dev/guide/benchmarks); nested/irregular reality check: 1.8–20% (dev.to tejas_page). Affects only the structured-data fraction of your prompts.
Evidence tier: T3 + local reproduction (two independent same-day local runs agree within ~1 point).
Quality risk: NEUTRAL on uniform data (accuracy equal-or-better in official benchmark); RISKY for deep nesting (format breaks down, less training exposure). Falsify with retrieval-QA over your own TOON-ized payloads.
Availability: GATEWAY-OR-SELF-HOST for your own MCP servers/hooks; NOT adoptable for Claude Code's native tool results. Effort to adopt: hours (own tools); minification is minutes.
Composability: natural for MCP servers returning tabular data; orthogonal to caching.
Validation protocol: ct.py your three largest real tool-result payloads in pretty/minified/TOON; 20-question retrieval quiz per encoding; adopt only where accuracy is flat.

Significant industry use, weaker or absent numbers

15. claude-mem (thedotmack/claude-mem) — compressed cross-session memory; adoption king: 81,976 stars (gh api), bigger than caveman

Layer: input (memory injection). Mechanism: 5 lifecycle hooks + Bun worker (:37777) over SQLite + Chroma; tiered retrieval: search ~50–100 tok/result, get_observations ~500–1,000; claims "~10x token savings by filtering before fetching details" (README).
Expected savings: the ~10x is a retrieval-path claim, not whole-session; no independent net accounting (injection + compression-call cost vs re-exploration saved) exists — same hole as cavemem (02).
Evidence tier: T3 (massive adoption, self-reported number).
Quality risk: NEUTRAL-to-RISKY — stale/wrong memories mislead sessions; degradation = acting on outdated project facts. Falsify by auditing 50 injected memories for currency.
Availability: CLAUDE-CODE-TODAY (also Gemini CLI, OpenCode). Effort to adopt: minutes-to-hours (persistent local service).
Composability: competitor to cavemem; SessionStart injection enlarges the cached prefix (interacts with the 29% write line).
Validation protocol: cavemem's week-long A/B, plus meter the worker's own SessionEnd compression calls.

16. aider repo map — graph-ranked code context: whole-repo awareness for ~1k tokens

Layer: input (codebase context selection). Mechanism: "a concise map of your whole git repository" with key signatures, ranked on a dependency graph; --map-tokens "defaults to 1k" (aider.chat/docs/repomap.html).
Expected savings: never published as a single number; influence is the evidence — Claude Code's Explore agent and code-intelligence plugins are descendants.
Evidence tier: T1 for mechanism/budget; T4 for savings magnitude.
Quality risk: NEGATIVE-COST by design intent (fewer wrong-file reads); degradation = stale symbols in the map. Falsify via tokens-per-solved-task, map on/off.
Availability: GATEWAY-OR-SELF-HOST (aider itself); pattern portable as a Claude Code skill. Effort to adopt: zero in aider; days to port well.
Composability: alternative to fff (04); pairs with grep retrieval.
Validation protocol: aider benchmark harness, --map-tokens 1024 vs 0, fixed task set; report tokens/solved.

17. LiteLLM gateway — spend tracking, exact + semantic caching, routing; name-checked in Anthropic's costs docs

Layer: infra / gateway. Mechanism: per-key spend tracking; response caching in 7 backends (semantic similarity_threshold 0.8); routing/fallbacks (docs.litellm.ai/docs/proxy/caching).
Expected savings: none quantified in its own docs; semantic-cache hits on coding-agent traffic (unique diffs, file contents) are near-worst-case — expect ~0 (ESTIMATE; kill K8).
Evidence tier: T1 product, T4 for any coding-agent savings number.
Quality risk: RISKY (semantic cache can return stale code for similar-but-different questions); NEUTRAL for spend tracking. Pass-through danger: a gateway that drops cache_control destroys the 74% caching floor (05).
Availability: GATEWAY-OR-SELF-HOST. Effort to adopt: days.
Composability: must preserve cache_control; tokenizer-blind routing mis-prices cross-boundary moves (08).
Validation protocol: replay a week of real traffic through the proxy; measure semantic hit rate (expect ~0 — kill the pattern with data) and verify end-to-end cache-read ratios unchanged.

18. Usage observability — ccusage (16,080 stars, gh api), /usage, JSONL tracing: saves nothing, underwrites everything

Layer: infra / observability. Mechanism: ccusage parses local transcript JSONL into daily/session/model costs; /usage attributes usage to skills, subagents, plugins, individual MCP servers (code.claude.com/docs/en/costs).
Expected savings: none direct; reveals the official denominators: "~$13/dev/active day", $150–250/month, 90% of users under $30/day, background burn <$0.04/session (same page).
Evidence tier: T1 (shipped tools on real local data); the dev.to JSONL trace is the strongest independent accounting published.
Quality risk: NEUTRAL. Failure mode: misattribution; falsify by reconciling ccusage totals against console billing.
Availability: CLAUDE-CODE-TODAY. Effort to adopt: minutes.
Composability: foundation for every validation protocol here; see 31-validation-harness.md.
Validation protocol: reconcile one week of ccusage vs console invoice; require <5% divergence before trusting any A/B in this file.

19. LLMLingua / LongLLMLingua / LLMLingua-2 — peer-reviewed ceiling ("up to 20x"), absent from every shipped coding agent

Layer: input (prompt body). Mechanism: small-LM perplexity pruning (EMNLP'23); question-aware coarse-to-fine for RAG (ACL'24: +21.4% RAG performance at 1/4 tokens); distilled classifier 3–6x faster (ACL'24 Findings) (github.com/microsoft/LLMLingua; 6,284 stars — maintenance slowing).
Expected savings: 20x headline is NL-domain. Killer arithmetic on the modeled day: compression that breaks prefix caching must clear (70.4+5.1+0.44)/x + 8.14 = 22 -> x ~= 5.5 (82% compression) just to break even against plain Claude Code caching; 4x compression loses money ($27.1 > $22) (ESTIMATE).
Evidence tier: T2 (peer-reviewed) for NL benchmarks; T4 for code-agent use (no code-domain validation found, searched).
Quality risk: RISKY for code — identifiers and syntax are what perplexity pruning drops; degradation = hallucinated symbol names. Falsify on SWE-bench-style tasks before any use.
Availability: GATEWAY-OR-SELF-HOST (Python lib; LangChain/LlamaIndex integrations). Effort to adopt: project (compression model in the hot path; GPU + latency).
Composability: CONFLICTS with prompt caching (05) — the defining anti-synergy of the input-compression category.
Validation protocol: run the break-even formula on your own JSONL profile first; only if x > 5.5 is plausible, run 20 code tasks compressed-vs-not and diff symbol-level accuracy.

20. Preprocessing pipeline — hooks filtering, markdown-not-HTML, CLI-over-MCP: shrink content BEFORE the model sees it

Layer: input (tool results). Mechanism: PreToolUse hooks rewrite commands (official example pipes test output through grep — "tens of thousands of tokens to hundreds", code.claude.com/docs/en/costs); fetch web as markdown; max_content_tokens caps runaway pages (10 kB page ~2,500 tok, 500 kB PDF ~125,000 tok — pricing docs); prefer gh/aws/gcloud/sentry-cli over equivalent MCP servers ("still more context-efficient" — official).
Expected savings: LOCAL: 630-line synthetic test log 10,108 -> 590 tok = -94.2% with all 3 failures preserved (local measurement, table above). Cited: 94% per web page (38,381 -> 2,788), "80-99% compression on build and test logs" (firecrawl.dev blog). Tool results ride the 29%+32% cache lines.
Evidence tier: T1 (official pattern) + local reproduction + T3 (Firecrawl, vendor-interested).
Quality risk: RISKY if filters hide the error the model needed; NEGATIVE-COST when failure signal is preserved — the local run shows both at once are achievable. Falsify by injecting novel error shapes and checking the filter passes them.
Availability: CLAUDE-CODE-TODAY (hooks, settings.json); this repo's filtered test runners (TESTING.md) are the same idea already shipped. Effort to adopt: hours.
Composability: stacks with everything; the Claude Code-native sibling of programmatic tool calling (11).
Validation protocol: per filter, a red-team set of 10 unusual failure modes; require 10/10 surfaced post-filter; track tokens saved per call from JSONL.

21. Fixed-overhead accounting — the ~14,328-token per-call baseline + MCP fleet bloat (diagnosis, not cure)

Layer: input (fixed per-call). Mechanism (of the measurement): independent JSONL traces ("consistently using approximately 14,328 tokens" — dev.to slima4); per-version prompt catalog (Piebald-AI/claude-code-system-prompts: Explore 575 tok, Plan mode 715, statusline 2,433); issue #52979 reports 20–30k on trivial prompts; community 15–20k multi-server MCP overhead, one deployment 143k/200k; Anthropic internal 134K pre-optimization; Anthropic cut its own tool-use system prompt 675 -> 290 tok (-57%) between Opus 4.7 and 4.8 (pricing page).
Expected savings: N/A directly; rent arithmetic: 14,328 x19 x6 x $1/MTok = $1.63/day (~7.4% of the modeled day) in cache-read rent (ESTIMATE). Cures are records 07 and 10.
Evidence tier: T3 (converging independent traces) + T1 (Anthropic's own figures). All cached after first write — overhead seeds the 32% read line, not full-price input (kill K9).
Quality risk: NEUTRAL (a fact, not an action); mis-reading it as uncached cost overstates the problem 10x.
Availability: CLAUDE-CODE-TODAY (audit via /context). Effort to adopt: minutes to audit.
Composability: the denominator for every percentage in this file; this machine's own numbers in 02-baseline-audit.md.
Validation protocol: /context on this setup; decompose system prompt / tools / memory / skills; reconcile with first-call cache_creation_input_tokens in JSONL; re-audit after applying 07 + 10.

Claims to kill (folklore ledger)

#	Claim in the wild	Verdict and corrected number
K1	"caveman cuts ~75% of your tokens"	Three-layer overstatement. 75% headline = pooled benchmark ratio (1214 -> 294 = 75.8%); the same table's per-task mean is 65% (range 22–87%) (README); local replication 58.5% (ultra). And it targets visible prose only = 17% of heavy-session dollars; whole-bill ~4–6% (ESTIMATE; mayhemcode ~4%). The README itself calls cost savings "a bonus".
K2	"wenyan/Classical-Chinese mode saves ~80%"	Character-token confusion: 80.9% char cut = 56.6% token cut (CJK ~1.47 chars/tok vs 3.35 English; local). Billing is tokens. wenyan-ultra reaches 74.5% tokens at maximum lossiness.
K3	"TOON cuts 60% of data tokens"	Locally 61.4% only vs pretty JSON on uniform tabular; vs minified it is 41.2%, and minification alone is 34.3% (free, riskless). Nested/irregular real payloads: 1.8–20% (independent).
K4	"Prompt caching saves 90%"	90% is the best launch-table row (book chat); the multi-turn row is -53%; writes bill 1.25–2x. The local heavy session was already maximally cached (92.83% reads) and reads+writes still cost 61% of dollars. Modeled-day actual: -74% vs no caching (ESTIMATE). Not an available saving for an already-cached session.
K5	"Model price ratios = cost ratios"	Partly false: code/CJK-heavy work can be close to sticker ratios, but prose/ASCII-heavy text gets the Opus-4.7+ tokenizer premium. Effective Fable:Sonnet input ratio is ~3.33x for code-heavy probes and ~4.3x for prose-heavy probes. Also kills the sweep-draft's inverted "~2.4–2.9x" (arithmetic error, corrected in record 08). Routers do not tokenizer-normalize.
K6	"Perplexity dropped MCP citing 72% context waste"	Single secondary source (nevo.systems); no primary statement found. Do not repeat.
K7	"fff saves X% tokens"	No token percentage exists in the README text — latency numbers and a chart image only. Cite the mechanism, not a number.
K8	"Semantic caching will cut your coding-agent bill"	LiteLLM's own docs claim no percentage; coding prompts are near-worst-case for similarity caches and hits risk stale code. No published coding-agent hit rate (searched).
K9	"The 14.3k overhead costs 14.3k input tokens per call"	It is cached: marginal cost 0.1x after first write (~$1.63/day rent on the modeled profile, not ~$16). Directionally real, economically 10x smaller than the naive reading.
K10	Phase-0 internal note: "caveman's 75% is character-level folklore"	Partially corrected: the repo's benchmark claims Claude-API token methodology with reproduction scripts; the 75%-pooled vs 65%-mean distinction explains the headline. The honest kill is K1 ("75 != 65 != 58.5, smallest bucket"), not "they counted characters".

White space — what nobody ships

Thinking-token optimization — 54.8% of output tokens, 20% of dollars, not disableable on Fable 5; only blunt levers exist (/effort, MAX_THINKING_TOKENS). No per-task-type effort routing, no thinking-waste detection. Caveman's README concedes the bucket explicitly.
Cache-economics tooling — reads+writes = 61% of heavy-session dollars; nothing optimizes breakpoint placement, predicts 5-min-vs-1-h TTL value, schedules compaction to minimize write spikes, or reports write amplification (ccusage shows totals, not causes). See 13-caching-exploitation.md.
Tokenizer-aware routing — the prose/ASCII tokenizer premium means every $/MTok router can mis-price cross-boundary switches; nobody normalizes, and code-heavy work needs separate counts.
Mid-session selective context pruning in Claude Code — the API has context editing (84% cut, vendor-proven); end users get only whole-conversation compaction.
Net-accounted memory — claude-mem (82k stars) and cavemem inject context and run compression calls; neither publishes injection-cost-vs-exploration-saved accounting.
Format-aware tool-result serialization — the locally-verified 41.2%/34.3% wins are not integrated into any agent's tool-result path or popular MCP server.
Independent replication infrastructure — every vendor self-reports; a public ct.py-style harness re-measuring plugin claims against live tokenizers does not exist (this dossier's method is itself novel prior art; see 31-validation-harness.md).
Output brevity with quality gates — caveman trades readability blindly; nothing compresses output only when a verifier confirms zero information loss.

Surprising findings

Token-cutting can be quality-improving, with vendor proof: Tool Search -85% tokens with accuracy 49 -> 74; programmatic calling -37% with accuracy up. Negative-cost optimization is shipped, not theoretical.
Anthropic admitted internal tool definitions hit 134K tokens pre-optimization, and cut its own tool-use system prompt 57% (675 -> 290) between model versions — vendor-side overhead shrinks release-over-release.
The memory niche out-adopted the compression niche: claude-mem 81,976 stars vs caveman 71,891 — and caveman 5x'd from ~14k in ~10 weeks since early April 2026, the fastest-growing plugin category of 2026.
The optimizer taxes itself: caveman's always-on skill listing costs 940 tokens of prefix per session (local), ~0.5% of the modeled day — while plain JSON minification (-34.3%, free) goes unmarketed and one experimental flag (agent teams, ~7x) moves cost more than every compression plugin in this scan combined.

Verification ledger

Number	Source / method
Stars: caveman 71,891; cavemem 521; cavekit 1,014; fff 8,351; claude-mem 81,976; ccusage 16,080; LLMLingua 6,284	`gh api repos/<repo> --jq .stargazers_count`, run locally
caveman "~75% of output tokens"; mean 65% (range 22–87%); pooled 1214 -> 294 (=75.8%); "Answer concisely." baseline; "thinking... untouched"; "cost savings a bonus"; compress ~46%; cavecrew ~60%; caveman-code "~2x fewer than Codex"; arXiv 2604.00025 "+26 points" (paper not independently verified)	curl of raw.githubusercontent.com/JuliusBrussee/caveman/main/README.md, lines 27/136/151/154/168/126/128/88/170
caveman local: ultra 58.5%; wenyan-full 56.6% tok / 80.9% chars; wenyan-ultra 74.5%; CJK ~1.47 chars/tok; tool deferral 1,420 -> ~60 (96%); prompt mix 0.44/6.73/92.83%; dollar split 32/29/20/17/2; thinking 54.8% of output; day ~$22	local measurement (phase-0); profile detailed in 01-economics-and-measurement.md
caveman ~4% of session total (independent); ~14k stars April 2026	mayhemcode.com/2026/04/caveman-claude-code-how-to-save-tokens.html
cavemem "~75% fewer prose tokens", architecture; family positioning	raw cavemem README + getcaveman.dev
fff "3-9 SECONDS per ripgrep spawn and sub-10 ms"; ~26 MB/14k files; "fewer grep roundtrips"; no token % in text	curl of raw fff README lines 22/72/673/714
Cache multipliers 0.1x/1.25x/2x; 512-tok Fable 5 minimum; free refresh; launch table -90/-86/-53	platform.claude.com/docs/en/build-with-claude/prompt-caching + claude.com/blog/prompt-caching + pricing page
Modeled no-cache day ~~$84 (-74%); LLMLingua break-even x~~=5.5; CLAUDE.md rent $0.31/day; 14.3k rent $1.63/day; 20k-fleet rent $2.28/day; caveman ceiling ~10% / realistic 4–6%; thinking ceiling $4.40/day; caveman listing ~0.5%/day; batch+Haiku ~5% unit cost	ESTIMATE — arithmetic shown in records 05/19/07/21/10/01/09/13 and the measurements section, on the modeled profile
Auto-compact ~83% (~167k), ~33k buffer; 76/7/17 split; 14,328 fixed overhead; $5.50 vs $10+ session	dev.to/slima4/where-do-your-claude-code-tokens-actually-go-we-traced-every-single-one-423e
"under 200 lines"; skills 30–100 tok; deferral default; CLI-over-MCP; /effort + MAX_THINKING_TOKENS=8000; Fable 5 thinking not disableable; ~7x teams; $13/day; $150–250/mo; <$30/day for 90%; <$0.04 background; LiteLLM name-check; grep-hook example; /usage attribution	code.claude.com/docs/en/costs
Firecrawl 3,847 -> 312 (91.9%); 41% path-scoping; 38,381 -> 2,788 (94%); 80–99% logs	firecrawl.dev/blog/claude-code-token-efficiency (vendor-interested)
Prices Fable 5 $10/$50, Opus 4.8 $5/$25, Sonnet 4.6 $3/$15, Haiku 4.5 $1/$5; "up to 35% more tokens" note; 675 -> 290 tool-use prompt (-57%); batch 50%; max_content_tokens page scale	platform.claude.com/docs/en/about-claude/pricing
Local tokenizer premium +42.7% prose / +39.2% one Rust sample raw (+44.7%/+39.7% net of 7-tok wrapper); Sonnet=Haiku identical counts; correction: prose premium holds, code/CJK can be near-neutral	/tmp/ct.py on Fable-family / Sonnet-family models, fixed prose + `crates/jackin-build-meta/src/lib.rs` head
Effort: 76% fewer output tokens at medium; +4.3 pts at -48% (Opus 4.5-era)	anthropic.com/news/claude-opus-4-5
Tool Search: "85% reduction"; 191,300 vs 122,800; 49 -> 74; 79.5 -> 88.1; ~55K five-server; 134K internal; programmatic 43,588 -> 27,297 (37%); retrieval 25.6 -> 28.5, GIA 46.5 -> 51.2	anthropic.com/engineering/advanced-tool-use
Context editing 84% / +39% / +29% (100-turn search eval)	claude.com/blog/context-management
Code Mode 12 -> 4 turns; 15–20k MCP overhead; 143k/200k deployment	thenewstack.io/how-to-reduce-mcp-token-bloat/
20–30k trivial-prompt reports; prompt catalog (Explore 575 / Plan 715 / statusline 2,433 tok)	github.com/anthropics/claude-code/issues/52979; github.com/Piebald-AI/claude-code-system-prompts
aider map: 1k default, graph ranking	aider.chat/docs/repomap.html
LiteLLM: 7 cache backends, similarity_threshold 0.8, no savings %	docs.litellm.ai/docs/proxy/caching
LLMLingua: 20x headline; +21.4% RAG at 1/4 tokens; 3–6x speedup; EMNLP'23/ACL'24 venues	github.com/microsoft/LLMLingua README
TOON official 39.9% fewer at 76.4% vs 75.0% accuracy; nested reality check 1.8–20%	toonformat.dev/guide/benchmarks; dev.to/tejas_page/toon-vs-json-when-60-token-savings-becomes-18-a-reality-check-3e60
Local format table 484/318/187 tabular (-34.3%/-61.4%/-41.2%), 272/194 nested (-28.7%); sweep same-day -60.6/-40.8/-33.6; log filter 10,108 -> 590 (-94.2%, 3/3 failures kept); `AGENTS.md` 2,744 tok (phase-0 2,738); caveman skill listing 940 tok; wrapper constant 7 tok	/tmp/ct.py claude-fable-5 on generated fixtures (/tmp/tok/gen.py), repo `AGENTS.md`, reconstructed skill listing, 1-char probe; grep filter shown above

03 — Prior art and market scan

On this page