# 03 — Prior art and market scan (https://jackin.tailrocks.com/research/token-optimization/03-prior-art-and-market-scan/)


# 03 — Prior art and market scan [#03--prior-art-and-market-scan]

**TL;DR**

* All four operator-named tools exist, verified from primary pages: **caveman** (71,891 stars) compresses visible prose; **cavemem** (521) stores pre-compressed memory; **cavekit** (1,014) is a spec-driven build loop; **fff** (8,351) is a resident file-search MCP with latency numbers (3–9 s -> sub-10 ms) but **no published token percentage**.
* The market crowds onto the two smallest cost buckets of a real heavy session — visible prose (17% of dollars) and memory files (slice of the 2% uncached input) — while **nothing ships** for the big three: cache reads (32%), cache writes (29%), thinking (20%) (local measurement; see 01-economics-and-measurement.md).
* The best-evidenced technique is negative-cost and already a Claude Code default: MCP tool deferral / Tool Search Tool — &#x2A;*85% tool-token cut with accuracy up 49% -> 74%** (Anthropic); locally 1,420 -> \~60 tok (96%) on an 11-tool fleet.
* Caching is the floor, not free money: on the modeled heavy day it already turns \~$84 into \~$22 (**-74%, ESTIMATE** from local profile + live multipliers) — the remaining bill is 61% cache reads+writes, and any technique that breaks prefix stability must beat \~5.5x compression just to break even (ESTIMATE, record 19).
* Local kills (fresh, /tmp/ct.py, Fable-family tokenizer): TOON saves **41.2% vs minified JSON**, not "60%" (34.3% of the headline is just minification); caveman's "\~75%" replicates as &#x2A;*58.5%** on tokens; the Fable-vs-Sonnet tokenizer gap is content-specific — prose/ASCII gets the premium, but code/CJK probes were near-neutral.

## Scope and method [#scope-and-method]

Scanned: the four operator-named tools, Claude Code's official cost levers, the plugin/MCP ecosystem, gateways, and research-grade prompt compression. Every external claim carries URL +. Local numbers use `/tmp/ct.py` (free `count_tokens` endpoint; the single-user-message wrapper adds a constant 7 tokens — percentages below are conservative by \<1 point). Dollar arithmetic uses the modeled heavy-day profile from 01-economics-and-measurement.md: \~6 sessions/day, \~19 calls/session, 5.5k uncached input, 85k cache-write, 1.17M cache-read, 27k output (54.8% thinking), \~$22/day split 32% cache-read / 29% cache-write / 20% thinking / 17% visible output / 2% uncached input. Validation protocols are designed to run on the harness in 31-validation-harness.md; this machine's own overhead baseline is in 02-baseline-audit.md.

## Market map [#market-map]

| Bucket                                                                 | Members (record #)                                                                                                                          |
| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| Industry standard (automatic or near-universal)                        | prompt caching (05), /clear+/compact hygiene (06), CLAUDE.md slimming (07), model tiering (08), effort control (09), MCP tool deferral (10) |
| Significant industry use                                               | caveman (01), fff (04), TOON serialization (14), claude-mem (15), aider repo map (16), LiteLLM gateway (17), ccusage + /usage (18)          |
| Proven tactics (vendor or peer-reviewed numbers, partial availability) | programmatic tool calling (11), context editing + memory tool (12), batch API + team discipline (13), LLMLingua family (19)                 |
| Engineer-verified facts and patterns                                   | preprocessing/hooks filtering (20), fixed-overhead accounting (21)                                                                          |
| Named-tool early adopters (low adoption, weak numbers)                 | cavemem (02), cavekit (03)                                                                                                                  |

## Local measurements run for this file (/tmp/ct.py, claude-fable-5) [#local-measurements-run-for-this-file-tmpctpy-claude-fable-5]

Serialization — identical 10-row x 4-field uniform dataset, plus a nested-config control:

| Sample                  | Tokens | vs pretty JSON | vs minified JSON |
| ----------------------- | ------ | -------------- | ---------------- |
| Pretty JSON (2-space)   | 484    | —              | —                |
| Minified JSON           | 318    | -34.3%         | —                |
| TOON                    | 187    | -61.4%         | -41.2%           |
| Nested config, pretty   | 272    | —              | —                |
| Nested config, minified | 194    | -28.7%         | —                |

An independent same-day run (sweep agent, different dataset, same shape) got -60.6%/-40.8%/-33.6% — agreement within \~1 point.

Tool-result filtering — synthetic 630-line cargo-test log (577 pass, 3 failure blocks), filtered with `grep -B1 -A6 -E "FAILED|panicked|error\[" | head -100`. Repetitive pass-lines are the favorable (and typical) case; the honest general claim stays "80–99%":

| Sample                                    | Lines | Tokens | Cut        |
| ----------------------------------------- | ----- | ------ | ---------- |
| Raw log                                   | 630   | 10,108 | —          |
| Filtered (all 3 failures fully preserved) | 35    | 590    | **-94.2%** |

Cross-tokenizer, identical fixed text (7-tok wrapper included):

| Sample                                                         | Fable 5 | Sonnet 4.6 | Haiku 4.5 | Fable premium                      |
| -------------------------------------------------------------- | ------- | ---------- | --------- | ---------------------------------- |
| English prose, 730 chars                                       | 224     | 157        | 157       | **+42.7%** (+44.7% net of wrapper) |
| Rust code, 1,800 bytes (`crates/jackin-build-meta/src/lib.rs`) | 777     | 558        | 558       | **+39.2%** (+39.7% net)            |

Sonnet 4.6 and Haiku 4.5 return byte-identical counts — shared tokenizer confirmed. Phase-0 (same day, other samples) got +15% (code) / +38% (prose); a tighter probe gives prose +35%, code −3%, CJK −4% — the premium is prose/ASCII-specific, not code-wide. Treat the Rust row above as sample-specific.

Self-cost of the optimizer: the caveman plugin family's always-on skill listing (7 skills, names + descriptions as injected this session) = **940 tokens** of prefix per session — \~0.5% of the modeled day in cache-read rent before it saves anything. Root <RepoFile path="AGENTS.md">AGENTS.md</RepoFile> re-measured at **2,744 tokens** (phase-0: 2,738; 0.2% drift).

***

## The four operator-named tools [#the-four-operator-named-tools]

### 01. caveman (JuliusBrussee/caveman) — terse-register output compression: "why use many token when few do trick"; README headline "cuts \~75% of output tokens"; 71,891 stars (gh api) [#01-caveman-juliusbrusseecaveman--terse-register-output-compression-why-use-many-token-when-few-do-trick-readme-headline-cuts-75-of-output-tokens-71891-stars-gh-api]

* **Layer:** output (visible prose only). &#x2A;*Mechanism:** skill-prompt register change — lite/full/ultra strip filler, wenyan variants answer in Classical Chinese; companions: caveman-compress (memory files, "\~46% input tokens"), cavecrew subagents ("\~60% fewer tokens than vanilla"), caveman-commit/review; sibling caveman-code claims "\~2x fewer tokens than Codex".
* **Expected savings:** visible output = 17% of modeled-day dollars ($3.74). Ultra locally cuts 58.5% of prose tokens; code/diffs pass through verbatim, so realistic whole-bill effect \~4–6%/day (ESTIMATE: 17% x 58.5% x prose share; independent mayhemcode estimate \~4%), hard ceiling \~10%.
* **Evidence tier:** T3, partially replicated locally. README benchmark : 10 tasks, per-task mean 65% (range 22–87%), pooled 1214 -> 294 = 75.8% — the "\~75%" headline is the pooled ratio, 65% the per-task mean, both off one table, against an "Answer concisely." baseline, reproduction scripts in `benchmarks/`. Local : ultra 58.5%; wenyan-full 56.6% tokens (80.9% chars); wenyan-ultra 74.5%.
* **Quality risk:** QUALITY-TRADE at ultra/wenyan (lossy register, unreadable transcripts); \~NEUTRAL at lite/full. The README concedes the economics: "Caveman only affects output tokens — thinking/reasoning tokens untouched... cost savings a bonus", and cites arXiv 2604.00025 (repo-cited, unverified) claiming brevity *improved* accuracy 26 points. Degradation = dropped caveats/rationale; falsify by blind-grading paired answers.
* **Availability:** CLAUDE-CODE-TODAY (plugin marketplace; installed in this environment). &#x2A;*Effort to adopt:** minutes.
* **Composability:** stacks with caching/hygiene/deferral; does not touch thinking (54.8% of output tokens locally); its own skill listing costs 940 tok/session of prefix (above); wenyan modes meet an unfavorable CJK tokenizer (\~1.47 chars/tok).
* **Validation protocol:** 20 fixed tasks, caveman-ultra vs "answer concisely"; visible-output tokens from JSONL usage minus thinking; blind-grade completeness on a 5-point rubric; require \<=0.5-point drop; report per-register deltas.

### 02. cavemem (JuliusBrussee/cavemem) — pre-compressed persistent memory over MCP; claims "\~75% fewer prose tokens"; 521 stars (gh api) [#02-cavemem-juliusbrusseecavemem--pre-compressed-persistent-memory-over-mcp-claims-75-fewer-prose-tokens-521-stars-gh-api]

* **Layer:** input (memory/context injection). &#x2A;*Mechanism:** session event -> redact -> caveman-compress -> SQLite+FTS5; MCP retrieval (BM25 + vector blend, alpha 0.5), progressive disclosure (snippets first, bodies on demand); code/URLs/paths preserved verbatim.
* **Expected savings:** unmeasurable from published data — no benchmark table, one worked example. On the modeled day, injection lands in the 29% cache-write + 32% cache-read lines; the offset (avoided re-exploration) is unquantified by anyone. Given local register data, expect \~55–60% on stored prose, not 75% (ESTIMATE).
* **Evidence tier:** T4 verging T3-weak (README; no independent replication found; adds its own MCP schema overhead).
* **Quality risk:** RISKY — lossy memory drops nuance; retrieval ranking unmeasured; degradation = confidently-wrong recalled "facts". Falsify by quizzing the store against original transcripts.
* **Availability:** CLAUDE-CODE-TODAY (MCP; also Cursor, Gemini CLI, OpenCode, Codex). &#x2A;*Effort to adopt:** minutes-to-hours.
* **Composability:** family stack ("cavekit orchestrates, caveman compresses output, cavemem compresses memory" — getcaveman.dev); competes with claude-mem (15); deferral (10) absorbs its schema cost.
* **Validation protocol:** one-week A/B (memory on/off) on one project; ccusage day totals; injected-context tokens vs re-exploration tokens saved; separately ct.py-verify the \~75% claim on 50 real observations.

### 03. cavekit (JuliusBrussee/cavekit) — spec-driven autonomous build loop; caveman-encoded blueprints "\~75% fewer tokens than prose"; 1,014 stars (gh api) [#03-cavekit-juliusbrusseecavekit--spec-driven-autonomous-build-loop-caveman-encoded-blueprints-75-fewer-tokens-than-prose-1014-stars-gh-api]

* **Layer:** turn-structure / orchestration artifacts. &#x2A;*Mechanism:** NL -> compressed blueprint kits -> parallel build -> verification; durable repo-root SPEC.md means /clear or compaction does not force re-deriving intent. Token relevance is secondary to the workflow change.
* **Expected savings:** no end-to-end number exists anywhere . Artifact compression touches a slice of the 2% uncached + write lines; the loop itself can *multiply* total calls — net sign unknown.
* **Evidence tier:** T4 — token claims inherited from caveman encoding; no published cavekit-vs-vanilla measurement.
* **Quality risk:** RISKY — autonomous iteration can spend far more than it saves; degradation = ballooning call counts per feature. Falsify with the protocol below.
* **Availability:** CLAUDE-CODE-TODAY (plugin). &#x2A;*Effort to adopt:** hours-to-days (whole-workflow change).
* **Composability:** designed for the caveman family; orthogonal to caching; the durable-spec idea composes with /clear (06) for free.
* **Validation protocol:** build the same 3 small features via cavekit and via a vanilla plan->implement session; compare total day tokens (ccusage), iteration counts, and diff quality; require equal-or-better diffs.

### 04. fff (dmtrKovalenko/fff) — resident file-search MCP: "Fewer grep roundtrips, less wasted context, faster answers" (README); 8,351 stars (gh api) [#04-fff-dmtrkovalenkofff--resident-file-search-mcp-fewer-grep-roundtrips-less-wasted-context-faster-answers-readme-8351-stars-gh-api]

* **Layer:** turn-structure (tool-call count) + tool-result tokens. &#x2A;*Mechanism:** always-warm Rust index (no per-query spawn or .gitignore re-read); MCP tools ffgrep/fffind with frecency ranking, git-aware annotations, weak-match detection ("flags scattered fuzzy noise before it floods the agent's context"), fuzzy fallback avoiding zero-result dead ends — each avoided dead end is one avoided tool\_use/tool\_result pair.
* **Expected savings:** token side unquantified by the project — README text publishes latency ("3-9 SECONDS per ripgrep spawn and sub-10 ms per FFF query" on a 500k-file Chromium checkout; \~26 MB resident on 14k files) and a token-efficiency *chart image*, but no percentage in text. Tool results ride the 29%+32% cache lines; direction favorable, magnitude unknown.
* **Evidence tier:** T1 for latency (published, reproducible); T4 for tokens (qualitative: "faster and more token-efficient than the built-in one").
* **Quality risk:** potential NEGATIVE-COST — better-ranked results mean fewer wrong-file reads; unproven on the token axis. Degradation = fuzzy false-positives polluting context; falsify via the A/B below.
* **Availability:** CLAUDE-CODE-TODAY (MCP; also Neovim/Rust/C/Node). &#x2A;*Effort to adopt:** minutes.
* **Composability:** competes with native Grep/Glob and LSP-style code intelligence (official docs: a "go to definition" call replaces grep + reading candidates); opposite philosophy to aider's map (16): ranked search vs ranked map.
* **Validation protocol:** fixed 20-query search suite on this repo, native Grep/Glob vs fff MCP; sum tool\_result tokens + call counts from session JSONL; require equal-or-better target-file hit rate.

***

## Industry standard (automatic or near-universal) [#industry-standard-automatic-or-near-universal]

### 05. Prompt caching — reads 0.1x, writes 1.25x/2x, automatic in Claude Code; the single biggest shipped lever [#05-prompt-caching--reads-01x-writes-125x2x-automatic-in-claude-code-the-single-biggest-shipped-lever]

* **Layer:** cache / price multiplier. &#x2A;*Mechanism:** prefix storage with free refresh-on-hit; 512-token minimum on Fable 5; Claude Code places breakpoints automatically (platform.claude.com/docs/en/build-with-claude/prompt-caching).
* **Expected savings:*&#x2A; already banked. Counterfactual on the modeled day: without caching, reads bill 10x — day \~$84 vs $22 = &#x2A;*-74% (ESTIMATE)**. Vendor launch table: up to -90% (book chat), -53% (multi-turn) (claude.com/blog/prompt-caching). The remaining bill is still 61% reads+writes — the target of 13-caching-exploitation.md.
* **Evidence tier:** T1 (live pricing + launch table); locally corroborated: heavy-session prompt mix 0.44% uncached / 6.73% write / 92.83% read (local measurement).
* **Quality risk:** NEGATIVE-COST (pure price, no behavior change). Failure mode is economic: prefix churn converts 0.1x reads into 1.25x writes; falsify by diffing usage fields across a deliberately churned prefix.
* **Availability:** CLAUDE-CODE-TODAY (automatic) / SDK (explicit breakpoints, 1h TTL). &#x2A;*Effort to adopt:** zero in Claude Code; hours of prefix-stability discipline in SDK.
* **Composability:** stacks with batch (13); undermined by anything mutating the prefix — model switch (08), compaction (06), dynamic system prompts, LLMLingua (19).
* **Validation protocol:** from session JSONL, compute hit ratio and re-price the same usage at no-cache rates (confirm \~74% on your traffic); check whether Claude Code ever issues 1h-TTL writes (open question affecting the 29% write line).

### 06. Context hygiene — /clear, steered /compact, auto-compact at \~83% capacity (\~167k) with \~33k reserved buffer [#06-context-hygiene--clear-steered-compact-auto-compact-at-83-capacity-167k-with-33k-reserved-buffer]

* **Layer:** turn-structure / context window. &#x2A;*Mechanism:** /clear resets ("Stale context wastes tokens on every subsequent message" — code.claude.com/docs/en/costs); /compact takes focus instructions; auto-compact summarizes near the threshold.
* **Expected savings:** unquantified by Anthropic. Independent JSONL trace: typical sessions = 76% useful work / 7% compaction summaries / 17% reserved headroom (dev.to slima4). Earlier /clear directly shrinks the 1.17M cache-read line (32% of modeled-day dollars) on every subsequent call.
* **Evidence tier:** T1 (official docs) + T3 (trace with thresholds).
* **Quality risk:** QUALITY-TRADE — summaries lose detail; degradation = re-asking answered questions post-compaction. Falsify by quizzing post-compaction state against pre-compaction facts.
* **Availability:** CLAUDE-CODE-TODAY. &#x2A;*Effort to adopt:** minutes (habits + compact instructions in CLAUDE.md).
* **Composability:** compaction rewrites the prefix — a hidden cache-write spike (anti-synergy with 05); /clear is cheaper than /compact when continuity is not needed; cavekit's durable spec (03) reduces what compaction must preserve.
* **Validation protocol:** capture usage.cache\_creation\_input\_tokens around 10 compaction events to quantify the write spike; compare task success for /clear-per-task vs run-long-and-compact.

### 07. CLAUDE.md slimming + path-scoped rules + skills offload — official guidance "Aim to keep CLAUDE.md under 200 lines" [#07-claudemd-slimming--path-scoped-rules--skills-offload--official-guidance-aim-to-keep-claudemd-under-200-lines]

* **Layer:** input (per-call fixed prefix). &#x2A;*Mechanism:** memory file rides every call's prefix; path-scoped rules load per subtree (this repo already does this via per-directory agent rule files); skills carry only name+description (30–100 tok) until invoked (code.claude.com/docs/en/costs).
* **Expected savings:** honest arithmetic is small in dollars: root <RepoFile path="AGENTS.md">AGENTS.md</RepoFile&#x3E; = 2,744 tok (local) -> x19 calls x6 sessions x $1/MTok cache-read = &#x2A;*$0.31/day rent (\~1.4% of the modeled day)**; an 80% trim saves \~$0.25/day plus window headroom. Firecrawl: 3,847 -> 312 tok = 91.9% on the file itself, "no quality regression"; 41% from path-scoping (firecrawl.dev blog).
* **Evidence tier:** T1 (official guidance) + T3 (Firecrawl, vendor-interested) + local file measurement.
* **Quality risk:** RISKY if overdone — one bad PR from a dropped rule erases months of rent savings; degradation = rule violations. Falsify by replaying rule-sensitive tasks against the slimmed file.
* **Availability:** CLAUDE-CODE-TODAY. &#x2A;*Effort to adopt:** hours (editorial; caveman-compress automates lossily at claimed \~46%).
* **Composability:** multiplies with caching (smaller prefix = smaller perpetual rent); synergy with skills/path-scoping; feeds record 21's denominator.
* **Validation protocol:** ct.py the file before/after; replay the 10 most rule-sensitive recent tasks on both versions; zero new rule violations allowed; log the $-rent delta.

### 08. Model tiering — Fable 5 $10/$50, Opus 4.8 $5/$25, Sonnet 4.6 $3/$15, Haiku 4.5 $1/$5 — with the tokenizer trap [#08-model-tiering--fable-5-1050-opus-48-525-sonnet-46-315-haiku-45-15--with-the-tokenizer-trap]

* **Layer:** infra / price-per-token and token count. &#x2A;*Mechanism:** /model switching, plan-on-big/build-on-small, subagent model pinning. Pricing-page note: "Opus 4.7 and later use a new tokenizer... may use up to 35% more tokens for the same fixed text" (platform.claude.com/docs/en/about-claude/pricing).
* **Expected savings:** independent trace: same session \~$5.50 on Sonnet vs \~$10+ on Opus (dev.to slima4). &#x2A;*Corrected cross-boundary math:** use **3.33x Fable:Sonnet*&#x2A; on code/CJK-heavy text and **\~4.3x** on prose/ASCII-heavy text after the tokenizer premium (ESTIMATE; verification). The sweep-agent draft of this file said "effectively \~2.4–2.9x" — that is the same arithmetic *inverted* (division for multiplication); corrected here: downgrades save at least sticker, and prose-heavy downgrades save more.
* **Evidence tier:** T1 (prices + official note) + local measurement; both fresh local samples *exceed* the official 35% ceiling — flagged.
* **Quality risk:** QUALITY-TRADE — capability cliff on hard tasks; degradation = failed/looping attempts on the small model (retries can cost more than the discount). Falsify with $-per-solved-task, never $/MTok.
* **Availability:** CLAUDE-CODE-TODAY. &#x2A;*Effort to adopt:** minutes.
* **Composability:** model switch invalidates the prompt cache (anti-synergy with 05) and changes the tokenizer basis mid-ledger; $/MTok routers (17) mis-price cross-boundary moves by up to \~1.45x.
* **Validation protocol:** fixed 10-task set on Fable 5 / Opus 4.8 / Sonnet 4.6; record solved-or-not + total $ from JSONL; tokenizer-normalize via ct.py on one canonical text per model; report $/solved-task.

### 09. Thinking/effort control (/effort, MAX\_THINKING\_TOKENS) — the only shipped lever on the largest output component [#09-thinkingeffort-control-effort-max_thinking_tokens--the-only-shipped-lever-on-the-largest-output-component]

* **Layer:** output (thinking; bills as output, $50/MTok on Fable 5). &#x2A;*Mechanism:** effort low/medium/high shapes reasoning depth; MAX\_THINKING\_TOKENS=8000 for fixed-budget models; adaptive-reasoning models ignore nonzero budgets; **Fable 5 cannot disable thinking**; default budgets "can be tens of thousands of tokens per request" (code.claude.com/docs/en/costs).
* **Expected savings:** vendor: "Set to a medium effort level, Opus 4.5 matches Sonnet 4.5's best score on SWE-bench Verified, but uses 76% fewer output tokens"; highest effort +4.3 points at -48% (anthropic.com/news/claude-opus-4-5; no Fable 5 curve published). Local ceiling: thinking = 54.8% of output tokens, &#x2A;*20% of modeled-day dollars ($4.40)**; halving thinking \~= $2.20/day (ESTIMATE).
* **Evidence tier:** T1 (vendor numbers, model-version-specific) + local decomposition (n=1).
* **Quality risk:** QUALITY-TRADE — extended thinking "significantly improves performance on complex planning" (docs); degradation = shallow plans/missed edge cases on hard tasks. Falsify per task type: easy tasks at low effort should show zero quality delta.
* **Availability:** CLAUDE-CODE-TODAY (/effort, env var) / SDK (effort parameter). &#x2A;*Effort to adopt:** minutes.
* **Composability:** orthogonal to all input-side levers; the ONLY shipped tool touching the bucket caveman (01) explicitly skips. White-space: per-task-type effort routing (Gaps 1).
* **Validation protocol:** 20 tasks stratified easy/hard at high vs medium effort; thinking tokens = JSONL output\_tokens minus count\_tokens of visible blocks; require no hard-task regression; report $/day delta.

### 10. MCP tool deferral / Tool Search Tool — the proven negative-cost optimization, now a Claude Code DEFAULT [#10-mcp-tool-deferral--tool-search-tool--the-proven-negative-cost-optimization-now-a-claude-code-default]

* **Layer:** input (tool schemas). &#x2A;*Mechanism:** schemas stay out of context (names only) until needed; API Tool Search Tool queries a registry; Claude Code defers MCP definitions by default ("only tool names enter context until Claude uses a specific tool" — code.claude.com/docs/en/costs). This research session itself runs on ToolSearch.
* **Expected savings:** vendor: "an 85% reduction in token usage" (context preserved 191,300 vs 122,800 of 200k), MCP-eval accuracy &#x2A;*49% -> 74%** (Opus 4) and 79.5% -> 88.1% (Opus 4.5); five-server example \~55K schema tokens; internal fleets hit 134K pre-optimization (anthropic.com/engineering/advanced-tool-use). Local: 11 schemas = 1,420 tok loaded vs \~60 deferred = &#x2A;*96%*&#x2A; (local measurement). Modeled day: an undeferred 20k fleet = 20k x19 x6 x $1/MTok = **$2.28/day (\~10%) read rent** (ESTIMATE).
* **Evidence tier:** T1 + local reproduction (consistent direction and magnitude).
* **Quality risk:** NEGATIVE-COST — vendor evals show accuracy *rises* (less irrelevant-tool confusion). Residual failure mode: a needed deferred tool goes unfound; falsify with task suites requiring rare tools.
* **Availability:** CLAUDE-CODE-TODAY (default) / SDK. &#x2A;*Effort to adopt:** zero; audit via /context and /mcp.
* **Composability:** makes large MCP fleets viable; complements CLI-over-MCP (20); absorbs the schema cost of cavemem (02) and fff (04).
* **Validation protocol:** /context audit loaded-vs-deferred; fixed task set exercising 3 rarely-used tools; require 100% tool-discovery success; report prefix-token delta.

***

## Proven tactics (vendor or peer-reviewed numbers) [#proven-tactics-vendor-or-peer-reviewed-numbers]

### 11. Programmatic tool calling / code-execution orchestration — 37% average cut with accuracy gains [#11-programmatic-tool-calling--code-execution-orchestration--37-average-cut-with-accuracy-gains]

* **Layer:** turn-structure / tool-result tokens. &#x2A;*Mechanism:** model writes code calling tools in a sandbox; intermediate results never enter context; community "Code Mode" collapses 12-turn MCP workflows into 4 (thenewstack.io).
* **Expected savings:** "Average usage dropped from 43,588 to 27,297 tokens, a 37% reduction on complex research tasks"; accuracy up (25.6 -> 28.5 internal retrieval; 46.5 -> 51.2 GIA) (anthropic.com/engineering/advanced-tool-use). On the modeled day, tool results ride the 29%+32% cache lines; a 37% cut on tool-heavy turns plausibly reaches low-double-digit % of day (ESTIMATE, workload-dependent).
* **Evidence tier:** T1 (vendor evals).
* **Quality risk:** NEGATIVE-COST per vendor numbers; failure mode is sandbox bugs silently eating results. Falsify by checksumming sandbox outputs against direct calls.
* **Availability:** SDK (API feature); Claude Code analog = Bash piping, hooks, subagent offload — not the API feature itself (absent from the costs-page levers). &#x2A;*Effort to adopt:** days (sandbox + re-plumbing).
* **Composability:** stacks with Tool Search (10); same philosophy as preprocessing (20).
* **Validation protocol:** port 5 multi-tool workflows to code-execution; require bit-identical final artifacts; compare total tokens.

### 12. Context editing + memory tool (API beta) — 84% token cut on a 100-turn eval while improving performance 39% [#12-context-editing--memory-tool-api-beta--84-token-cut-on-a-100-turn-eval-while-improving-performance-39]

* **Layer:** cache / context window (API-level). &#x2A;*Mechanism:** server-side auto-clearing of stale tool calls near the limit; memory tool persists state outside the window.
* **Expected savings:** "reducing token consumption by 84%" on a 100-turn web-search eval; +39% (memory+editing) / +29% (editing alone) performance (claude.com/blog/context-management). Maps onto the 32% read line for SDK agents.
* **Evidence tier:** T1 (vendor evals; search-task domain, NOT code-editing).
* **Quality risk:** NEGATIVE-COST on agentic search per vendor; unproven on code workloads where an evicted tool result may be needed 40 turns later. Falsify on SWE-bench-style tasks before trusting.
* **Availability:** SDK (beta header); NOT-USER-ACCESSIBLE as a tunable inside Claude Code (which ships compaction instead). &#x2A;*Effort to adopt:** hours (flag + config).
* **Composability:** edits invalidate cached prefix sections — same cache-vs-pruning tension as compaction (06); the user-side gap is Gaps 4.
* **Validation protocol:** replicate the eval shape on 10 long code sessions; measure tokens + task success vs a no-editing control.

### 13. Batch API (50% off) + agent-team cost discipline (\~7x warning) [#13-batch-api-50-off--agent-team-cost-discipline-7x-warning]

* **Layer:** infra / price multiplier and session multiplicity. &#x2A;*Mechanism:** async batch = "50% discount on both input and output tokens", stacks with caching (platform.claude.com/docs/en/about-claude/pricing); teams = "approximately 7x more tokens than standard sessions when teammates run in plan mode" (code.claude.com/docs/en/costs).
* **Expected savings:** moving offline-able work (CI review, nightly triage, evals) to batch: $22 -> $11 on the modeled day upper bound (realistic = the offline fraction); batch+Haiku stack \~5% of Fable-interactive unit cost (ESTIMATE: 0.5 x 0.1 input ratio). Team discipline is loss-avoidance: one env flag (CLAUDE\_CODE\_EXPERIMENTAL\_AGENT\_TEAMS=1) swings cost more than every compression plugin in this scan combined.
* **Evidence tier:** T1 (both verbatim from live Anthropic pages).
* **Quality risk:** NEUTRAL (batch trades latency only); teams degrade economically, not qualitatively. Falsify batch quality by diffing batch-vs-interactive outputs on identical jobs.
* **Availability:** SDK (batch) / CLAUDE-CODE-TODAY (teams flag). &#x2A;*Effort to adopt:** days (re-architect offline jobs); zero (don't over-spawn).
* **Composability:** batch x caching x Haiku is the deepest legitimate price stack; teams anti-compose with everything.
* **Validation protocol:** run the nightly PR-review job both ways for a week; require equal review-finding recall; report $ delta.

### 14. Serialization-format engineering: TOON + JSON minification \[LOCALLY VERIFIED] — a third of the headline is just minifying [#14-serialization-format-engineering-toon--json-minification-locally-verified--a-third-of-the-headline-is-just-minifying]

* **Layer:** input (structured data in prompts / tool results). &#x2A;*Mechanism:** TOON replaces repeated JSON keys/braces with a schema header + CSV-like rows.
* **Expected savings:*&#x2A; LOCAL TABLE ABOVE : TOON -61.4% vs pretty, **-41.2% vs minified**; minification alone -34.3% (tabular) / -28.7% (nested) — free and riskless. Official: 39.9% fewer tokens at equal-or-better retrieval accuracy (76.4% vs 75.0%) (toonformat.dev/guide/benchmarks); nested/irregular reality check: 1.8–20% (dev.to tejas\_page). Affects only the structured-data fraction of your prompts.
* **Evidence tier:** T3 + local reproduction (two independent same-day local runs agree within \~1 point).
* **Quality risk:** NEUTRAL on uniform data (accuracy equal-or-better in official benchmark); RISKY for deep nesting (format breaks down, less training exposure). Falsify with retrieval-QA over your own TOON-ized payloads.
* **Availability:** GATEWAY-OR-SELF-HOST for your own MCP servers/hooks; NOT adoptable for Claude Code's native tool results. &#x2A;*Effort to adopt:** hours (own tools); minification is minutes.
* **Composability:** natural for MCP servers returning tabular data; orthogonal to caching.
* **Validation protocol:** ct.py your three largest real tool-result payloads in pretty/minified/TOON; 20-question retrieval quiz per encoding; adopt only where accuracy is flat.

***

## Significant industry use, weaker or absent numbers [#significant-industry-use-weaker-or-absent-numbers]

### 15. claude-mem (thedotmack/claude-mem) — compressed cross-session memory; adoption king: 81,976 stars (gh api), bigger than caveman [#15-claude-mem-thedotmackclaude-mem--compressed-cross-session-memory-adoption-king-81976-stars-gh-api-bigger-than-caveman]

* **Layer:** input (memory injection). &#x2A;*Mechanism:** 5 lifecycle hooks + Bun worker (:37777) over SQLite + Chroma; tiered retrieval: search \~50–100 tok/result, get\_observations \~500–1,000; claims "\~10x token savings by filtering before fetching details" (README).
* **Expected savings:** the \~10x is a retrieval-path claim, not whole-session; no independent net accounting (injection + compression-call cost vs re-exploration saved) exists — same hole as cavemem (02).
* **Evidence tier:** T3 (massive adoption, self-reported number).
* **Quality risk:** NEUTRAL-to-RISKY — stale/wrong memories mislead sessions; degradation = acting on outdated project facts. Falsify by auditing 50 injected memories for currency.
* **Availability:** CLAUDE-CODE-TODAY (also Gemini CLI, OpenCode). &#x2A;*Effort to adopt:** minutes-to-hours (persistent local service).
* **Composability:** competitor to cavemem; SessionStart injection enlarges the cached prefix (interacts with the 29% write line).
* **Validation protocol:** cavemem's week-long A/B, plus meter the worker's own SessionEnd compression calls.

### 16. aider repo map — graph-ranked code context: whole-repo awareness for \~1k tokens [#16-aider-repo-map--graph-ranked-code-context-whole-repo-awareness-for-1k-tokens]

* **Layer:** input (codebase context selection). &#x2A;*Mechanism:** "a concise map of your whole git repository" with key signatures, ranked on a dependency graph; `--map-tokens` "defaults to 1k" (aider.chat/docs/repomap.html).
* **Expected savings:** never published as a single number; influence is the evidence — Claude Code's Explore agent and code-intelligence plugins are descendants.
* **Evidence tier:** T1 for mechanism/budget; T4 for savings magnitude.
* **Quality risk:** NEGATIVE-COST by design intent (fewer wrong-file reads); degradation = stale symbols in the map. Falsify via tokens-per-solved-task, map on/off.
* **Availability:** GATEWAY-OR-SELF-HOST (aider itself); pattern portable as a Claude Code skill. &#x2A;*Effort to adopt:** zero in aider; days to port well.
* **Composability:** alternative to fff (04); pairs with grep retrieval.
* **Validation protocol:** aider benchmark harness, `--map-tokens 1024` vs `0`, fixed task set; report tokens/solved.

### 17. LiteLLM gateway — spend tracking, exact + semantic caching, routing; name-checked in Anthropic's costs docs [#17-litellm-gateway--spend-tracking-exact--semantic-caching-routing-name-checked-in-anthropics-costs-docs]

* **Layer:** infra / gateway. &#x2A;*Mechanism:** per-key spend tracking; response caching in 7 backends (semantic similarity\_threshold 0.8); routing/fallbacks (docs.litellm.ai/docs/proxy/caching).
* **Expected savings:** **none quantified in its own docs**; semantic-cache hits on coding-agent traffic (unique diffs, file contents) are near-worst-case — expect \~0 (ESTIMATE; kill K8).
* **Evidence tier:** T1 product, T4 for any coding-agent savings number.
* **Quality risk:** RISKY (semantic cache can return stale code for similar-but-different questions); NEUTRAL for spend tracking. Pass-through danger: a gateway that drops cache\_control destroys the 74% caching floor (05).
* **Availability:** GATEWAY-OR-SELF-HOST. &#x2A;*Effort to adopt:** days.
* **Composability:** must preserve cache\_control; tokenizer-blind routing mis-prices cross-boundary moves (08).
* **Validation protocol:** replay a week of real traffic through the proxy; measure semantic hit rate (expect \~0 — kill the pattern with data) and verify end-to-end cache-read ratios unchanged.

### 18. Usage observability — ccusage (16,080 stars, gh api), /usage, JSONL tracing: saves nothing, underwrites everything [#18-usage-observability--ccusage-16080-stars-gh-api-usage-jsonl-tracing-saves-nothing-underwrites-everything]

* **Layer:** infra / observability. &#x2A;*Mechanism:** ccusage parses local transcript JSONL into daily/session/model costs; /usage attributes usage to skills, subagents, plugins, individual MCP servers (code.claude.com/docs/en/costs).
* **Expected savings:** none direct; reveals the official denominators: "\~$13/dev/active day", $150–250/month, 90% of users under $30/day, background burn \<$0.04/session (same page).
* **Evidence tier:** T1 (shipped tools on real local data); the dev.to JSONL trace is the strongest independent accounting published.
* **Quality risk:** NEUTRAL. Failure mode: misattribution; falsify by reconciling ccusage totals against console billing.
* **Availability:** CLAUDE-CODE-TODAY. &#x2A;*Effort to adopt:** minutes.
* **Composability:** foundation for every validation protocol here; see 31-validation-harness.md.
* **Validation protocol:** reconcile one week of ccusage vs console invoice; require \<5% divergence before trusting any A/B in this file.

### 19. LLMLingua / LongLLMLingua / LLMLingua-2 — peer-reviewed ceiling ("up to 20x"), absent from every shipped coding agent [#19-llmlingua--longllmlingua--llmlingua-2--peer-reviewed-ceiling-up-to-20x-absent-from-every-shipped-coding-agent]

* **Layer:** input (prompt body). &#x2A;*Mechanism:** small-LM perplexity pruning (EMNLP'23); question-aware coarse-to-fine for RAG (ACL'24: +21.4% RAG performance at 1/4 tokens); distilled classifier 3–6x faster (ACL'24 Findings) (github.com/microsoft/LLMLingua; 6,284 stars — maintenance slowing).
* **Expected savings:** 20x headline is NL-domain. &#x2A;*Killer arithmetic on the modeled day:** compression that breaks prefix caching must clear (70.4+5.1+0.44)/x + 8.14 = 22 -> **x \~= 5.5 (82% compression) just to break even** against plain Claude Code caching; 4x compression *loses money* ($27.1 > $22) (ESTIMATE).
* **Evidence tier:** T2 (peer-reviewed) for NL benchmarks; T4 for code-agent use (no code-domain validation found, searched).
* **Quality risk:** RISKY for code — identifiers and syntax are what perplexity pruning drops; degradation = hallucinated symbol names. Falsify on SWE-bench-style tasks before any use.
* **Availability:** GATEWAY-OR-SELF-HOST (Python lib; LangChain/LlamaIndex integrations). &#x2A;*Effort to adopt:** project (compression model in the hot path; GPU + latency).
* **Composability:** CONFLICTS with prompt caching (05) — the defining anti-synergy of the input-compression category.
* **Validation protocol:** run the break-even formula on your own JSONL profile first; only if x > 5.5 is plausible, run 20 code tasks compressed-vs-not and diff symbol-level accuracy.

### 20. Preprocessing pipeline — hooks filtering, markdown-not-HTML, CLI-over-MCP: shrink content BEFORE the model sees it [#20-preprocessing-pipeline--hooks-filtering-markdown-not-html-cli-over-mcp-shrink-content-before-the-model-sees-it]

* **Layer:** input (tool results). &#x2A;*Mechanism:** PreToolUse hooks rewrite commands (official example pipes test output through grep — "tens of thousands of tokens to hundreds", code.claude.com/docs/en/costs); fetch web as markdown; `max_content_tokens` caps runaway pages (10 kB page \~2,500 tok, 500 kB PDF \~125,000 tok — pricing docs); prefer gh/aws/gcloud/sentry-cli over equivalent MCP servers ("still more context-efficient" — official).
* **Expected savings:*&#x2A; LOCAL: 630-line synthetic test log 10,108 -> 590 tok = **-94.2% with all 3 failures preserved** (local measurement, table above). Cited: 94% per web page (38,381 -> 2,788), "80-99% compression on build and test logs" (firecrawl.dev blog). Tool results ride the 29%+32% cache lines.
* **Evidence tier:** T1 (official pattern) + local reproduction + T3 (Firecrawl, vendor-interested).
* **Quality risk:** RISKY if filters hide the error the model needed; NEGATIVE-COST when failure signal is preserved — the local run shows both at once are achievable. Falsify by injecting novel error shapes and checking the filter passes them.
* **Availability:** CLAUDE-CODE-TODAY (hooks, settings.json); this repo's filtered test runners (<RepoFile path="TESTING.md">TESTING.md</RepoFile>) are the same idea already shipped. &#x2A;*Effort to adopt:** hours.
* **Composability:** stacks with everything; the Claude Code-native sibling of programmatic tool calling (11).
* **Validation protocol:** per filter, a red-team set of 10 unusual failure modes; require 10/10 surfaced post-filter; track tokens saved per call from JSONL.

### 21. Fixed-overhead accounting — the \~14,328-token per-call baseline + MCP fleet bloat (diagnosis, not cure) [#21-fixed-overhead-accounting--the-14328-token-per-call-baseline--mcp-fleet-bloat-diagnosis-not-cure]

* **Layer:** input (fixed per-call). &#x2A;*Mechanism (of the measurement):** independent JSONL traces ("consistently using approximately 14,328 tokens" — dev.to slima4); per-version prompt catalog (Piebald-AI/claude-code-system-prompts: Explore 575 tok, Plan mode 715, statusline 2,433); issue #52979 reports 20–30k on trivial prompts; community 15–20k multi-server MCP overhead, one deployment 143k/200k; Anthropic internal 134K pre-optimization; Anthropic cut its own tool-use system prompt 675 -> 290 tok (-57%) between Opus 4.7 and 4.8 (pricing page).
* **Expected savings:*&#x2A; N/A directly; rent arithmetic: 14,328 x19 x6 x $1/MTok = **$1.63/day (\~7.4% of the modeled day) in cache-read rent** (ESTIMATE). Cures are records 07 and 10.
* **Evidence tier:** T3 (converging independent traces) + T1 (Anthropic's own figures). All cached after first write — overhead seeds the 32% read line, not full-price input (kill K9).
* **Quality risk:** NEUTRAL (a fact, not an action); mis-reading it as uncached cost overstates the problem 10x.
* **Availability:** CLAUDE-CODE-TODAY (audit via /context). &#x2A;*Effort to adopt:** minutes to audit.
* **Composability:** the denominator for every percentage in this file; this machine's own numbers in 02-baseline-audit.md.
* **Validation protocol:** /context on this setup; decompose system prompt / tools / memory / skills; reconcile with first-call cache\_creation\_input\_tokens in JSONL; re-audit after applying 07 + 10.

***

## Claims to kill (folklore ledger) [#claims-to-kill-folklore-ledger]

| #   | Claim in the wild                                                  | Verdict and corrected number                                                                                                                                                                                                                                                                                                                                                 |
| --- | ------------------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| K1  | "caveman cuts \~75% of your tokens"                                | Three-layer overstatement. 75% headline = *pooled* benchmark ratio (1214 -> 294 = 75.8%); the same table's per-task mean is 65% (range 22–87%) (README); local replication 58.5% (ultra). And it targets visible prose only = 17% of heavy-session dollars; whole-bill \~4–6% (ESTIMATE; mayhemcode \~4%). The README itself calls cost savings "a bonus".                   |
| K2  | "wenyan/Classical-Chinese mode saves \~80%"                        | Character-token confusion: 80.9% char cut = 56.6% token cut (CJK \~1.47 chars/tok vs 3.35 English; local). Billing is tokens. wenyan-ultra reaches 74.5% tokens at maximum lossiness.                                                                                                                                                                                        |
| K3  | "TOON cuts 60% of data tokens"                                     | Locally 61.4% only vs *pretty* JSON on uniform tabular; vs minified it is 41.2%, and minification alone is 34.3% (free, riskless). Nested/irregular real payloads: 1.8–20% (independent).                                                                                                                                                                                    |
| K4  | "Prompt caching saves 90%"                                         | 90% is the best launch-table row (book chat); the multi-turn row is -53%; writes bill 1.25–2x. The local heavy session was already maximally cached (92.83% reads) and reads+writes still cost **61% of dollars**. Modeled-day actual: -74% vs no caching (ESTIMATE). Not an available saving for an already-cached session.                                                 |
| K5  | "Model price ratios = cost ratios"                                 | Partly false: code/CJK-heavy work can be close to sticker ratios, but prose/ASCII-heavy text gets the Opus-4.7+ tokenizer premium. Effective Fable:Sonnet input ratio is \~3.33x for code-heavy probes and \~4.3x for prose-heavy probes. Also kills the sweep-draft's inverted "\~2.4–2.9x" (arithmetic error, corrected in record 08). Routers do not tokenizer-normalize. |
| K6  | "Perplexity dropped MCP citing 72% context waste"                  | Single secondary source (nevo.systems); no primary statement found. Do not repeat.                                                                                                                                                                                                                                                                                           |
| K7  | "fff saves X% tokens"                                              | No token percentage exists in the README text — latency numbers and a chart image only. Cite the mechanism, not a number.                                                                                                                                                                                                                                                    |
| K8  | "Semantic caching will cut your coding-agent bill"                 | LiteLLM's own docs claim no percentage; coding prompts are near-worst-case for similarity caches and hits risk stale code. No published coding-agent hit rate (searched).                                                                                                                                                                                                    |
| K9  | "The 14.3k overhead costs 14.3k input tokens per call"             | It is cached: marginal cost 0.1x after first write (\~$1.63/day rent on the modeled profile, not \~$16). Directionally real, economically 10x smaller than the naive reading.                                                                                                                                                                                                |
| K10 | Phase-0 internal note: "caveman's 75% is character-level folklore" | Partially corrected: the repo's benchmark claims Claude-API *token* methodology with reproduction scripts; the 75%-pooled vs 65%-mean distinction explains the headline. The honest kill is K1 ("75 != 65 != 58.5, smallest bucket"), not "they counted characters".                                                                                                         |

## White space — what nobody ships [#white-space--what-nobody-ships]

1. **Thinking-token optimization** — 54.8% of output tokens, 20% of dollars, not disableable on Fable 5; only blunt levers exist (/effort, MAX\_THINKING\_TOKENS). No per-task-type effort routing, no thinking-waste detection. Caveman's README concedes the bucket explicitly.
2. **Cache-economics tooling** — reads+writes = 61% of heavy-session dollars; nothing optimizes breakpoint placement, predicts 5-min-vs-1-h TTL value, schedules compaction to minimize write spikes, or reports write amplification (ccusage shows totals, not causes). See 13-caching-exploitation.md.
3. **Tokenizer-aware routing** — the prose/ASCII tokenizer premium means every $/MTok router can mis-price cross-boundary switches; nobody normalizes, and code-heavy work needs separate counts.
4. **Mid-session selective context pruning in Claude Code** — the API has context editing (84% cut, vendor-proven); end users get only whole-conversation compaction.
5. **Net-accounted memory** — claude-mem (82k stars) and cavemem inject context and run compression calls; neither publishes injection-cost-vs-exploration-saved accounting.
6. **Format-aware tool-result serialization** — the locally-verified 41.2%/34.3% wins are not integrated into any agent's tool-result path or popular MCP server.
7. **Independent replication infrastructure** — every vendor self-reports; a public ct.py-style harness re-measuring plugin claims against live tokenizers does not exist (this dossier's method is itself novel prior art; see 31-validation-harness.md).
8. **Output brevity with quality gates** — caveman trades readability blindly; nothing compresses output only when a verifier confirms zero information loss.

## Surprising findings [#surprising-findings]

* Token-cutting can be quality-improving, with vendor proof: Tool Search -85% tokens with accuracy 49 -> 74; programmatic calling -37% with accuracy up. Negative-cost optimization is shipped, not theoretical.
* Anthropic admitted internal tool definitions hit 134K tokens pre-optimization, and cut its own tool-use system prompt 57% (675 -> 290) between model versions — vendor-side overhead shrinks release-over-release.
* The memory niche out-adopted the compression niche: claude-mem 81,976 stars vs caveman 71,891 — and caveman 5x'd from \~14k in \~10 weeks since early April 2026, the fastest-growing plugin category of 2026.
* The optimizer taxes itself: caveman's always-on skill listing costs 940 tokens of prefix per session (local), \~0.5% of the modeled day — while plain JSON minification (-34.3%, free) goes unmarketed and one experimental flag (agent teams, \~7x) moves cost more than every compression plugin in this scan combined.

## Verification ledger [#verification-ledger]

| Number                                                                                                                                                                                                                                                                                                           | Source / method                                                                                                                                                                             |
| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Stars: caveman 71,891; cavemem 521; cavekit 1,014; fff 8,351; claude-mem 81,976; ccusage 16,080; LLMLingua 6,284                                                                                                                                                                                                 | `gh api repos/&lt;repo&gt; --jq .stargazers_count`, run locally                                                                                                                             |
| caveman "\~75% of output tokens"; mean 65% (range 22–87%); pooled 1214 -> 294 (=75.8%); "Answer concisely." baseline; "thinking... untouched"; "cost savings a bonus"; compress \~46%; cavecrew \~60%; caveman-code "\~2x fewer than Codex"; arXiv 2604.00025 "+26 points" (paper not independently verified)    | curl of raw\.githubusercontent.com/JuliusBrussee/caveman/main/README.md, lines 27/136/151/154/168/126/128/88/170                                                                            |
| caveman local: ultra 58.5%; wenyan-full 56.6% tok / 80.9% chars; wenyan-ultra 74.5%; CJK \~1.47 chars/tok; tool deferral 1,420 -> \~60 (96%); prompt mix 0.44/6.73/92.83%; dollar split 32/29/20/17/2; thinking 54.8% of output; day \~$22                                                                       | local measurement (phase-0); profile detailed in 01-economics-and-measurement.md                                                                                                            |
| caveman \~4% of session total (independent); \~14k stars April 2026                                                                                                                                                                                                                                              | mayhemcode.com/2026/04/caveman-claude-code-how-to-save-tokens.html                                                                                                                          |
| cavemem "\~75% fewer prose tokens", architecture; family positioning                                                                                                                                                                                                                                             | raw cavemem README + getcaveman.dev                                                                                                                                                         |
| fff "3-9 SECONDS per ripgrep spawn and sub-10 ms"; \~26 MB/14k files; "fewer grep roundtrips"; no token % in text                                                                                                                                                                                                | curl of raw fff README lines 22/72/673/714                                                                                                                                                  |
| Cache multipliers 0.1x/1.25x/2x; 512-tok Fable 5 minimum; free refresh; launch table -90/-86/-53                                                                                                                                                                                                                 | platform.claude.com/docs/en/build-with-claude/prompt-caching + claude.com/blog/prompt-caching + pricing page                                                                                |
| Modeled no-cache day ~~$84 (-74%); LLMLingua break-even x~~=5.5; CLAUDE.md rent $0.31/day; 14.3k rent $1.63/day; 20k-fleet rent $2.28/day; caveman ceiling \~10% / realistic 4–6%; thinking ceiling $4.40/day; caveman listing \~0.5%/day; batch+Haiku \~5% unit cost                                            | ESTIMATE — arithmetic shown in records 05/19/07/21/10/01/09/13 and the measurements section, on the modeled profile                                                                         |
| Auto-compact \~83% (\~167k), \~33k buffer; 76/7/17 split; 14,328 fixed overhead; $5.50 vs $10+ session                                                                                                                                                                                                           | dev.to/slima4/where-do-your-claude-code-tokens-actually-go-we-traced-every-single-one-423e                                                                                                  |
| "under 200 lines"; skills 30–100 tok; deferral default; CLI-over-MCP; /effort + MAX\_THINKING\_TOKENS=8000; Fable 5 thinking not disableable; \~7x teams; $13/day; $150–250/mo; \<$30/day for 90%; \<$0.04 background; LiteLLM name-check; grep-hook example; /usage attribution                                 | code.claude.com/docs/en/costs                                                                                                                                                               |
| Firecrawl 3,847 -> 312 (91.9%); 41% path-scoping; 38,381 -> 2,788 (94%); 80–99% logs                                                                                                                                                                                                                             | firecrawl.dev/blog/claude-code-token-efficiency (vendor-interested)                                                                                                                         |
| Prices Fable 5 $10/$50, Opus 4.8 $5/$25, Sonnet 4.6 $3/$15, Haiku 4.5 $1/$5; "up to 35% more tokens" note; 675 -> 290 tool-use prompt (-57%); batch 50%; max\_content\_tokens page scale                                                                                                                         | platform.claude.com/docs/en/about-claude/pricing                                                                                                                                            |
| Local tokenizer premium +42.7% prose / +39.2% one Rust sample raw (+44.7%/+39.7% net of 7-tok wrapper); Sonnet=Haiku identical counts; correction: prose premium holds, code/CJK can be near-neutral                                                                                                             | /tmp/ct.py on Fable-family / Sonnet-family models, fixed prose + `crates/jackin-build-meta/src/lib.rs` head                                                                                 |
| Effort: 76% fewer output tokens at medium; +4.3 pts at -48% (Opus 4.5-era)                                                                                                                                                                                                                                       | anthropic.com/news/claude-opus-4-5                                                                                                                                                          |
| Tool Search: "85% reduction"; 191,300 vs 122,800; 49 -> 74; 79.5 -> 88.1; \~55K five-server; 134K internal; programmatic 43,588 -> 27,297 (37%); retrieval 25.6 -> 28.5, GIA 46.5 -> 51.2                                                                                                                        | anthropic.com/engineering/advanced-tool-use                                                                                                                                                 |
| Context editing 84% / +39% / +29% (100-turn search eval)                                                                                                                                                                                                                                                         | claude.com/blog/context-management                                                                                                                                                          |
| Code Mode 12 -> 4 turns; 15–20k MCP overhead; 143k/200k deployment                                                                                                                                                                                                                                               | thenewstack.io/how-to-reduce-mcp-token-bloat/                                                                                                                                               |
| 20–30k trivial-prompt reports; prompt catalog (Explore 575 / Plan 715 / statusline 2,433 tok)                                                                                                                                                                                                                    | github.com/anthropics/claude-code/issues/52979; github.com/Piebald-AI/claude-code-system-prompts                                                                                            |
| aider map: 1k default, graph ranking                                                                                                                                                                                                                                                                             | aider.chat/docs/repomap.html                                                                                                                                                                |
| LiteLLM: 7 cache backends, similarity\_threshold 0.8, no savings %                                                                                                                                                                                                                                               | docs.litellm.ai/docs/proxy/caching                                                                                                                                                          |
| LLMLingua: 20x headline; +21.4% RAG at 1/4 tokens; 3–6x speedup; EMNLP'23/ACL'24 venues                                                                                                                                                                                                                          | github.com/microsoft/LLMLingua README                                                                                                                                                       |
| TOON official 39.9% fewer at 76.4% vs 75.0% accuracy; nested reality check 1.8–20%                                                                                                                                                                                                                               | toonformat.dev/guide/benchmarks; dev.to/tejas\_page/toon-vs-json-when-60-token-savings-becomes-18-a-reality-check-3e60                                                                      |
| Local format table 484/318/187 tabular (-34.3%/-61.4%/-41.2%), 272/194 nested (-28.7%); sweep same-day -60.6/-40.8/-33.6; log filter 10,108 -> 590 (-94.2%, 3/3 failures kept); <RepoFile path="AGENTS.md">AGENTS.md</RepoFile> 2,744 tok (phase-0 2,738); caveman skill listing 940 tok; wrapper constant 7 tok | /tmp/ct.py claude-fable-5 on generated fixtures (/tmp/tok/gen.py), repo <RepoFile path="AGENTS.md">AGENTS.md</RepoFile>, reconstructed skill listing, 1-char probe; grep filter shown above |