Token-optimization tools — a deep comparison

This folder is a dedicated, self-contained, diagram-driven teardown of the token-optimization tools an operator most often weighs for a Claude Code workflow: caveman, headroom, RTK, and lean-ctx. It collects, consolidates, and deepens the material that was previously spread across the token-optimization dossier (the headroom deep-dive, the compression-market sweep, and the RTK chapter) into one place, and adds what those files did not have: equal-depth design teardowns of every tool, flowcharts of how each one actually works, a feature-by-feature has/lacks matrix, and a straight answer to the operator's question — is there one product that combines them all, and if not, where does each win?

The folder began as a three-way (caveman/headroom/RTK) comparison; lean-ctx was added in a later round because it is the missing fourth data point — the integrated "context runtime" that tries to be all three at once and ships the code-intelligence lever the three-way concluded none of the three had. The folder name and page structure are deliberately generic so further tools can join the same matrices without another rename.

Most claims trace to primary sources already cited in the dossier; this folder is largely a re-synthesis. Research dates: headroom and the compression market were swept 2026-06-15; RTK was swept 2026-06-18; lean-ctx was swept, source-audited, and locally built + benchmarked 2026-06-20. Product percentages are vendor self-reported unless explicitly marked as locally reproduced or independently measured, and are tiered T1–T4 throughout. The broader economics (the modeled session profile, the dollar split, the 10× verdict) live in the token-optimization dossier; this folder assumes them and focuses on the tools.

This is the canonical, deepest comparison of these tools. The dossier chapters it grew out of — 53 (headroom), 54 (compression market), and 56 (RTK) — remain in place for their broader scope and cross-references, and now point here for the head-to-head.

TL;DR

Three of the four are points on one pipeline; the fourth tries to be the whole pipeline. Caveman compresses what the model writes (visible output, ~17% of a heavy session's dollars). Headroom and RTK both compress what the model reads (input, the ~61% cache buckets) at opposite ends of a breadth/determinism trade. lean-ctx is a different category — an integrated context runtime that occupies every interception point the other three split between them (shell hook + MCP read + proxy) and adds a persistent code graph plus a verification layer.
lean-ctx is the superset monolith this hub argued no one was building — and it largely confirms the prediction. It does reach further than any single specialist (it is the only one that bundles a persistent, queryable code graph — the structural-retrieval lever the three-way said all three lacked). But it pays for that reach with the largest footprint of the four by far: a 64.7 MB binary, a daemon, a dashboard, SQLite stores, 77 MCP tools, and host writes across up to 34 agents. The combining page weighs "integrated runtime" against "layered specialists" honestly.
No single tool does everything well, and the layered stack still composes. The "best of each" remains a stack: one output tool (caveman), one input layer (RTK at the Bash boundary, headroom on the wire, or lean-ctx as the runtime). Caveman never overlaps the input tools, so it always composes. The input tools overlap and should not be doubled blindly. Headroom even ships a tokens_saved_rtk data plane — vendor-side proof the input tools are designed to layer, not merge.
Every tool is real on its target bucket and every tool is over-marketed by the same category error. "75%" (caveman), "60–95%" (headroom), "60–90%" (RTK), and "up to 99%" (lean-ctx) are per-payload or per-session best cases, not whole-bill dollar savings. Corrected to the whole bill, each lands in the low double digits of dollars at best, because most input tokens already read at the 0.1× cache price and none of the four touches the 20%-of-dollars thinking bucket. There is no 10× here; the dossier's ≈2.5× / ≈5–6.2×-with-routing verdict is unmoved.
Rank by evidence, not stars. Three repos carry PR-inflated star counts (caveman 74.4k, RTK 63.6k, headroom 33.4k) with abnormally low watcher ratios; lean-ctx is the youngest and least-inflated (2.8k★, README-honest about it) but has no independent third-party benchmark. On evidence: caveman has a locally reproduced 58.5% output cut; headroom has production telemetry (median 4.8% whole-session) plus one independent 47.5%; RTK has no whole-session telemetry and no independent benchmark; lean-ctx reproduces locally here (96–99% on code reads, <10% on prose/config) with the most honest self-instrumentation (bounce-netted, signed ledger) but is self-measured on a GPT tokenizer.
The sweet spot is still two layers, not the kitchen sink. Caveman (output) plus one input layer captures the bulk of the realizable win. Reach for lean-ctx when you specifically want the code-graph / memory / verification surface and can carry its footprint; otherwise the lean caveman + RTK (or caveman + headroom-MCP) stack is lower-risk.

The one idea that organizes everything: where each tool intercepts

A coding agent's token bill splits into a small number of buckets. Each tool's entire character — its risk, its reach, its cache behavior — follows from where in the pipeline it acts. Three of the four bite on one place each; lean-ctx bites on several at once.

                THE CODING-AGENT TOKEN BILL  (heavy modeled session)
   ┌──────────────────────────────────────────────────────────────────────┐
   │  INPUT  (what the model READS)                          61% of $       │
   │    • system prompt + CLAUDE.md + tool schemas                          │
   │    • conversation history                                              │
   │    • tool outputs / logs / build & test output                        │
   │    • file reads / RAG chunks / search results                         │
   │      └── billed as cache-write (29%) + cache-read (32%)                │
   │                                                                        │
   │  THINKING  (reasoning tokens, billed as output, invisible)  20% of $   │
   │                                                                        │
   │  VISIBLE OUTPUT  (the prose the model writes back)          17% of $   │
   │                                                                        │
   │  uncached input                                              2% of $   │
   └──────────────────────────────────────────────────────────────────────┘

        CAVEMAN  ─────────────────────────────────►  VISIBLE OUTPUT
                 shrinks what the model writes              (17%)

        RTK      ─────────────────────────────────►  a SLICE of INPUT:
                 Bash command output, at the tool          shell output only
                 boundary, before it enters context        (part of the 61%)

        HEADROOM ─────────────────────────────────►  BROAD INPUT:
                 tool outputs, files, RAG, history,        most of the 61%
                 on the API wire or via MCP

        LEAN-CTX ════╗═══════════════════════════►  Bash output (shell hook)
                     ╠═══════════════════════════►  native reads (MCP, 10 modes)
                     ╠═══════════════════════════►  history/prose (proxy, opt-in)
                     ╚═══════════════════════════►  + PERSISTENT CODE GRAPH below
                 one runtime spanning every input point + structural retrieval

        (nothing here)  ─────────────────────────►  THINKING  (20%)
                 only the effort / model-routing levers reach it

Read off this diagram the facts the rest of the folder elaborates:

Caveman never overlaps with any of the others. It works on output; they work on input. Running caveman plus any input tool is strictly additive — no double-counting is even possible.
Headroom and RTK look like they overlap, but they act at different points. RTK filters a command's output at the Bash tool boundary, before it is ever sent; headroom compresses the request on the API wire (or an observation via MCP), which includes whatever RTK already filtered plus everything RTK never sees. They compose; the only redundancy is re-compressing the exact same bytes twice.
RTK's reach is a strict subset of headroom's reach. Anything that does not flow through a shell — native Read, Grep, Glob, RAG, conversation history — is invisible to RTK and visible to headroom. RTK buys cache-safety and zero ML cost by giving up reach.
lean-ctx tries to occupy all of those points in one process — shell hook (RTK's slot), MCP read (headroom-MCP's slot), proxy (headroom-proxy's slot) — and then adds a layer below the read that none of the others have: a persistent, queryable code graph (the structural-retrieval lever). It is not a fourth point on the line; it is a runtime drawn across the line. The cost of that is footprint, detailed throughout.
The biggest single bucket after cache-reads — thinking, 20% — is untouched by all four. This is why no stack of these tools reaches 10×.

The determinism gradient

The tools line up on a single axis from "no machinery at all" to "a full ML runtime." This gradient predicts their latency, their host footprint, their failure modes, and their reach. The first three are points on it; lean-ctx spans it — its default core is as deterministic as RTK, but its opt-in layers (embeddings, proxy prose rewrite) reach into headroom's ML/proxy territory.

   LESS MACHINERY                                                MORE MACHINERY
   (less reach, less risk)                          (more reach, more reversibility)
   ◄──────────────────────────────────────────────────────────────────────────►

   CAVEMAN                 RTK                  HEADROOM            LEAN-CTX
   ───────                 ───                  ────────            ────────
   a PROMPT, not           12 deterministic     router + typed      a RUNTIME spanning
   a program               Rust filters         compressors + a     the whole axis:
   (model compresses       keyed on the         trained ML model    deterministic core
    its own decoding)      command; no model    (kompress-base)     (tree-sitter/BM25/
                                                                    entropy) + OPT-IN
                                                                    embeddings & proxy

   ML in loop:  NO         NO                   YES (always)        NO by default;
                                                                    opt-in embeddings/proxy
   recovery:    none       tee on FAIL          CCR retrieve        archive + ctx_expand
   runtime:     ~0         5–15 ms/cmd          P50 52ms/P99 4.2s   daemon + 64.7 MB binary
   reach:       output     Bash output          everything on wire  every input point +
                                                                    persistent code graph

More machinery buys breadth and reversibility, but it costs latency, host effects, and a real attack surface. Caveman is the zero-machinery extreme; RTK is maximum determinism without an ML stage; headroom pays for an ML model to buy reach across every input source; lean-ctx is deterministic-by-default but the broadest of all, paying in footprint rather than mandatory ML.

Master comparison table

Axis	Caveman	Headroom	RTK	lean-ctx
One-line identity	Output-register compressor	Broad input-compression pipeline	Deterministic Bash-output compressor	Integrated context runtime
Direction	Output — what the model writes	Input — what the model reads (broad)	Input — what the model reads (Bash only)	Input, all of it + code-graph retrieval (no output)
Primary bucket	Visible prose (~17% of $)	Tool outputs / logs / RAG / files / history (the 61% cache lines)	Shell-command output (a slice of the 61%)	Native reads + shell + history + providers (most of the 61%)
Engine type	Markdown instruction (no code path)	Router + typed compressors + ML model	12 deterministic Rust filters keyed on the command	Tree-sitter AST + entropy/TF-IDF + 56 shell patterns + BM25/graph; opt-in embeddings
Mechanism	Skill-prompt register change (terse English / Classical Chinese)	Typed compressors (AST/JSON/log/search) + trained text model + drop-low-value	Filter / group / truncate / dedup on 100+ known command formats	10 read modes + handle cache + CFT Φ-scoring + knapsack compiler
ML in the hot path	No	Yes (`kompress-base`, auto-downloaded)	No	No by default (deterministic core); opt-in embeddings + proxy prose
Lossiness	Lossy, no recovery	Reversible (originals in CCR via `headroom_retrieve`)	Lossy; tee-recovery on command failure only	Reversible (archive + `ctx_expand`, FTS5-searchable)
Touches code?	Passes code/diffs verbatim	Outlines bodies, passes raw code ~0%	Trims `cat`/`grep`; language-aware regex code filter (10 langs, per-read)	Its strongest case — tree-sitter outline 96–99% on code (per-read)
Touches thinking?	No	No	No	No
Cache interaction	Neutral (output side)	Safe in MCP/library, risk in proxy mode	Safe by construction (write-time at the tool boundary)	Safe in MCP/hook (+~13-tok handle re-reads); proxy cache-safe-by-design but lossy
MCP schema rent	~940 tok/session skill listing	Yes, in MCP mode	None	Yes — 77 tools (dynamic loading: core+session only at startup)
Reach limit	Visible prose only	Everything in the request	Bash calls only — native `Read`/`Edit`/`Grep`/`Glob` bypass it	Broadest — reaches native reads (MCP), shell (hook), history (proxy); no output
Form factor	Claude Code plugin / skills	pip/npm/docker + Rust core + local ML runtime	Single Rust binary + a PreToolUse hook	64.7 MB Rust binary + daemon + dashboard + SQLite + MCP/hook/proxy/HTTP
Persistent state	None (hooks track tokens)	CCR store + cross-agent memory + `learn`	SQLite history (`rtk gain`)	CCP session + knowledge graph + property graph + BM25 + Context OS
Self-cost	~940-tok prefix + 2 hooks	Per-request ML + proxy latency + MCP rent	Hook host-write + hook-conflict surface; ~5–15 ms/cmd	Largest — daemon, dashboard, DBs, 77-tool schema, host writes ×34 agents
Best evidence	Output ultra −58.5% local (token) vs 75% claim	−66.1% self-report mix; median 4.8% whole-session (50k+ sessions); 47.5% independent	60–90% per-command, vendor-only; no whole-session telemetry, no independent benchmark	96–99% on code reads locally reproduced here; bounce-netted signed ledger; no independent benchmark, GPT tokenizer
Adoption (2026-06-18/20)	74,446★ / 166 watchers	33,359★ / 111 watchers	63,608★ / 146 watchers	2,800★ / 19 watchers — youngest, least-inflated
License	Plugin/skill model (MIT)	Apache-2.0	Apache-2.0	Apache-2.0 (local free; paid cloud sync)

The verdict in one paragraph

No one product combines everything and gets the best of each — and lean-ctx, the one tool that genuinely tries to, illustrates why rather than refuting it. The three specialists win because they specialize: caveman by being a free prompt that touches the 5×-priced output class with zero machinery; RTK by being deterministic and cache-safe-by-construction on the one input slice that is both large and concrete (shell output), at zero ML and zero MCP cost; headroom by paying for an ML stage and a proxy to reach the input sources RTK cannot (native reads, RAG, history) and to make compression reversible. lean-ctx adds something real none of them have — a persistent, queryable code graph (the structural-retrieval lever) plus a verification layer that proves the saving — but to get there it carries a 64.7 MB binary, a daemon, a dashboard, several databases, a 77-tool schema, and host writes across dozens of agents: every cost the combining page predicted a monolith would inherit. So the choice is not "specialists vs the one that does it all"; it is layered specialists vs an integrated runtime — the lean stack when you want the cheapest cache-safe win, lean-ctx when you specifically want its code-graph/memory/verification surface and can carry the footprint. Either way, none of the four touches thinking, so none is a 10× story.

How to read this folder

Page	What it answers
01 — Caveman design	How the output-register compressor works: the "prompt, not a program" magic, the six levels, the hooks, the cavecrew/cavemem/cavekit ecosystem.
02 — Headroom design	How the broad input pipeline works: the ContentRouter, the typed compressors, the `kompress-base` ML stage, the live-zone cache-stabilization magic, CCR, learn, the four deployment modes.
03 — RTK design	How the deterministic Bash-boundary compressor works: the six-phase lifecycle, the 12 strategies, the code filter, the hook modes, the reach limit.
04 — lean-ctx design	How the integrated context runtime works: the superset thesis, what it productizes from each of the other three, the code graph + RRF search the three lacked, CFT Φ-scoring, the verification/proof layer, and the monolith-tax footprint — with first-party build + benchmark numbers.
05 — Head-to-head	The feature has/lacks matrix (now four-way), the internals side-by-side, and the best case for each — where each beats the others.
06 — Combining	Is there one product? lean-ctx as the real test of the monolith thesis, the layered stack, the published head-to-head numbers, integrated-runtime-vs-stack by project shape, and the jackin' adoption order.
07 — Evidence and claims	Benchmarks, what is real vs self-report, the consolidated claim graveyard, adoption-stat caveats, and the validation harness.
08 — Records, ledger & unverified	The formal per-technique records (C1 / H1–H4 / R1 / L1), the full consolidated source ledger, and the unverified-claims register — vague or vendor-only numbers kept and marked "not proven" so nothing prior is lost.
09 — Gaps, open questions & next brief	What this research still misses: uncompared axes, capabilities a fresh sweep surfaced, the unanswered questions, and a ready-to-paste `/goal` brief for the next round.
10 — First-party measurements	The first measured (not self-reported) numbers: this repo's token decomposition (94% cache-read), RTK's reach ceiling at 16.5% of observation tokens vs native `Read` at 76.2%, and a locally built lean-ctx benchmark (96–99% on code reads, <10% on prose). Full 6-arm A/B still INCOMPLETE.
11 — Extended comparison axes	The six axes the earlier pages never compared: security/privacy/supply-chain, project health & sustainability, interaction with Claude Code native context features, build-vs-buy vs an output-style, subscriber tasks-per-cap (which inverts the $-per-task order), and non-coding generality.
12 — Implementation deep-dive	The first source-level pass: all four repos cloned and read, comparing rival implementations of each shared feature (shell compression, code outlining, cache-safety, memory, savings accounting, output register) with `file:line` evidence — whose code is better and why. Confirms lean-ctx out-engineers the specialists on most input features (tree-sitter vs regex, bounce-netted signed ledger vs `SUM`, measured cache-safety ratio) yet still has no output register, and logs the source-verified corrections to earlier pages.

Token-optimization tools — caveman, headroom, RTK & lean-ctx, compared