# Token-optimization tools — caveman, headroom, RTK & lean-ctx, compared (https://jackin.tailrocks.com/research/token-optimization-tools/) # Token-optimization tools — a deep comparison [#token-optimization-tools--a-deep-comparison] This folder is a dedicated, self-contained, diagram-driven teardown of the token-optimization tools an operator most often weighs for a Claude Code workflow: **caveman**, **headroom**, **RTK**, and **lean-ctx**. It collects, consolidates, and deepens the material that was previously spread across the token-optimization dossier (the headroom deep-dive, the compression-market sweep, and the RTK chapter) into one place, and adds what those files did not have: equal-depth design teardowns of every tool, flowcharts of how each one actually works, a feature-by-feature has/lacks matrix, and a straight answer to the operator's question — *is there one product that combines them all, and if not, where does each win?* The folder began as a three-way (caveman/headroom/RTK) comparison; **lean-ctx** was added in a later round because it is the missing fourth data point — the integrated "context runtime" that tries to be all three at once *and* ships the code-intelligence lever the three-way concluded none of the three had. The folder name and page structure are deliberately generic so further tools can join the same matrices without another rename. Most claims trace to primary sources already cited in the dossier; this folder is largely a re-synthesis. Research dates: headroom and the compression market were swept 2026-06-15; RTK was swept 2026-06-18; **lean-ctx was swept, source-audited, and locally built + benchmarked 2026-06-20**. Product percentages are vendor self-reported unless explicitly marked as locally reproduced or independently measured, and are tiered T1–T4 throughout. The broader economics (the modeled session profile, the dollar split, the 10× verdict) live in the [token-optimization dossier](/research/token-optimization/); this folder assumes them and focuses on the tools. ## TL;DR [#tldr] * **Three of the four are points on one pipeline; the fourth tries to be the whole pipeline.** Caveman compresses what the model **writes** (visible output, \~17% of a heavy session's dollars). Headroom and RTK both compress what the model **reads** (input, the \~61% cache buckets) at opposite ends of a breadth/determinism trade. **lean-ctx is a different category** — an integrated context *runtime* that occupies every interception point the other three split between them (shell hook + MCP read + proxy) *and* adds a persistent code graph plus a verification layer. * **lean-ctx is the superset monolith this hub argued no one was building — and it largely confirms the prediction.** It does reach further than any single specialist (it is the only one that bundles a persistent, queryable code graph — the structural-retrieval lever the three-way said all three lacked). But it pays for that reach with the largest footprint of the four by far: a 64.7 MB binary, a daemon, a dashboard, SQLite stores, 77 MCP tools, and host writes across up to 34 agents. The [combining page](/research/token-optimization-tools/06-combining/) weighs "integrated runtime" against "layered specialists" honestly. * **No single tool does everything well, and the layered stack still composes.** The "best of each" remains a stack: one output tool (caveman), one input layer (RTK at the Bash boundary, headroom on the wire, or lean-ctx as the runtime). Caveman never overlaps the input tools, so it always composes. The input tools overlap and should not be doubled blindly. Headroom even ships a `tokens_saved_rtk` data plane — vendor-side proof the input tools are designed to layer, not merge. * **Every tool is real on its target bucket and every tool is over-marketed by the same category error.** "75%" (caveman), "60–95%" (headroom), "60–90%" (RTK), and "up to 99%" (lean-ctx) are per-payload or per-session best cases, not whole-bill dollar savings. Corrected to the whole bill, each lands in the low double digits of dollars at best, because most input tokens already read at the 0.1× cache price and **none of the four touches the 20%-of-dollars thinking bucket**. There is no 10× here; the dossier's ≈2.5× / ≈5–6.2×-with-routing verdict is unmoved. * **Rank by evidence, not stars.** Three repos carry PR-inflated star counts (caveman 74.4k, RTK 63.6k, headroom 33.4k) with abnormally low watcher ratios; **lean-ctx is the youngest and least-inflated** (2.8k★, README-honest about it) but has *no* independent third-party benchmark. On evidence: caveman has a locally reproduced 58.5% output cut; headroom has production telemetry (median 4.8% whole-session) plus one independent 47.5%; RTK has no whole-session telemetry and no independent benchmark; lean-ctx reproduces locally here (96–99% on *code* reads, \<10% on prose/config) with the most honest self-instrumentation (bounce-netted, signed ledger) but is self-measured on a GPT tokenizer. * **The sweet spot is still two layers, not the kitchen sink.** Caveman (output) plus one input layer captures the bulk of the realizable win. Reach for lean-ctx when you specifically want the code-graph / memory / verification surface and can carry its footprint; otherwise the lean caveman + RTK (or caveman + headroom-MCP) stack is lower-risk. ## The one idea that organizes everything: where each tool intercepts [#the-one-idea-that-organizes-everything-where-each-tool-intercepts] A coding agent's token bill splits into a small number of buckets. Each tool's entire character — its risk, its reach, its cache behavior — follows from *where* in the pipeline it acts. Three of the four bite on one place each; lean-ctx bites on several at once. ```text THE CODING-AGENT TOKEN BILL (heavy modeled session) ┌──────────────────────────────────────────────────────────────────────┐ │ INPUT (what the model READS) 61% of $ │ │ • system prompt + CLAUDE.md + tool schemas │ │ • conversation history │ │ • tool outputs / logs / build & test output │ │ • file reads / RAG chunks / search results │ │ └── billed as cache-write (29%) + cache-read (32%) │ │ │ │ THINKING (reasoning tokens, billed as output, invisible) 20% of $ │ │ │ │ VISIBLE OUTPUT (the prose the model writes back) 17% of $ │ │ │ │ uncached input 2% of $ │ └──────────────────────────────────────────────────────────────────────┘ CAVEMAN ─────────────────────────────────► VISIBLE OUTPUT shrinks what the model writes (17%) RTK ─────────────────────────────────► a SLICE of INPUT: Bash command output, at the tool shell output only boundary, before it enters context (part of the 61%) HEADROOM ─────────────────────────────────► BROAD INPUT: tool outputs, files, RAG, history, most of the 61% on the API wire or via MCP LEAN-CTX ════╗═══════════════════════════► Bash output (shell hook) ╠═══════════════════════════► native reads (MCP, 10 modes) ╠═══════════════════════════► history/prose (proxy, opt-in) ╚═══════════════════════════► + PERSISTENT CODE GRAPH below one runtime spanning every input point + structural retrieval (nothing here) ─────────────────────────► THINKING (20%) only the effort / model-routing levers reach it ``` Read off this diagram the facts the rest of the folder elaborates: 1. **Caveman never overlaps with any of the others.** It works on output; they work on input. Running caveman plus any input tool is strictly additive — no double-counting is even possible. 2. **Headroom and RTK look like they overlap, but they act at different points.** RTK filters a command's output at the Bash tool boundary, *before* it is ever sent; headroom compresses the request on the API wire (or an observation via MCP), which includes whatever RTK already filtered plus everything RTK never sees. They compose; the only redundancy is re-compressing the exact same bytes twice. 3. **RTK's reach is a strict subset of headroom's reach.** Anything that does not flow through a shell — native `Read`, `Grep`, `Glob`, RAG, conversation history — is invisible to RTK and visible to headroom. RTK buys cache-safety and zero ML cost by giving up reach. 4. **lean-ctx tries to occupy all of those points in one process** — shell hook (RTK's slot), MCP read (headroom-MCP's slot), proxy (headroom-proxy's slot) — and then adds a layer *below* the read that none of the others have: a persistent, queryable code graph (the structural-retrieval lever). It is not a fourth point on the line; it is a runtime drawn *across* the line. The cost of that is footprint, detailed throughout. 5. **The biggest single bucket after cache-reads — thinking, 20% — is untouched by all four.** This is why no stack of these tools reaches 10×. ## The determinism gradient [#the-determinism-gradient] The tools line up on a single axis from "no machinery at all" to "a full ML runtime." This gradient predicts their latency, their host footprint, their failure modes, and their reach. The first three are points on it; **lean-ctx spans it** — its default core is as deterministic as RTK, but its opt-in layers (embeddings, proxy prose rewrite) reach into headroom's ML/proxy territory. ```text LESS MACHINERY MORE MACHINERY (less reach, less risk) (more reach, more reversibility) ◄──────────────────────────────────────────────────────────────────────────► CAVEMAN RTK HEADROOM LEAN-CTX ─────── ─── ──────── ──────── a PROMPT, not 12 deterministic router + typed a RUNTIME spanning a program Rust filters compressors + a the whole axis: (model compresses keyed on the trained ML model deterministic core its own decoding) command; no model (kompress-base) (tree-sitter/BM25/ entropy) + OPT-IN embeddings & proxy ML in loop: NO NO YES (always) NO by default; opt-in embeddings/proxy recovery: none tee on FAIL CCR retrieve archive + ctx_expand runtime: ~0 5–15 ms/cmd P50 52ms/P99 4.2s daemon + 64.7 MB binary reach: output Bash output everything on wire every input point + persistent code graph ``` More machinery buys breadth and reversibility, but it costs latency, host effects, and a real attack surface. Caveman is the zero-machinery extreme; RTK is maximum determinism without an ML stage; headroom pays for an ML model to buy reach across every input source; **lean-ctx is deterministic-by-default but the broadest of all, paying in footprint rather than mandatory ML.** ## Master comparison table [#master-comparison-table] | Axis | **Caveman** | **Headroom** | **RTK** | **lean-ctx** | | ------------------------ | ---------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- | | One-line identity | Output-register compressor | Broad input-compression pipeline | Deterministic Bash-output compressor | Integrated context runtime | | Direction | **Output** — what the model writes | **Input** — what the model reads (broad) | **Input** — what the model reads (Bash only) | **Input, all of it** + code-graph retrieval (no output) | | Primary bucket | Visible prose (\~17% of $) | Tool outputs / logs / RAG / files / history (the 61% cache lines) | Shell-command output (a slice of the 61%) | Native reads + shell + history + providers (most of the 61%) | | Engine type | **Markdown instruction** (no code path) | Router + typed compressors + ML model | 12 deterministic Rust filters keyed on the command | Tree-sitter AST + entropy/TF-IDF + 56 shell patterns + BM25/graph; opt-in embeddings | | Mechanism | Skill-prompt register change (terse English / Classical Chinese) | Typed compressors (AST/JSON/log/search) + trained text model + drop-low-value | Filter / group / truncate / dedup on 100+ known command formats | 10 read modes + handle cache + CFT Φ-scoring + knapsack compiler | | ML in the hot path | No | **Yes** (`kompress-base`, auto-downloaded) | No | **No by default** (deterministic core); opt-in embeddings + proxy prose | | Lossiness | **Lossy, no recovery** | **Reversible** (originals in CCR via `headroom_retrieve`) | Lossy; tee-recovery on command **failure** only | **Reversible** (archive + `ctx_expand`, FTS5-searchable) | | Touches code? | Passes code/diffs verbatim | Outlines bodies, passes raw code \~0% | Trims `cat`/`grep`; language-aware regex code filter (10 langs, per-read) | **Its strongest case** — tree-sitter outline 96–99% on code (per-read) | | Touches thinking? | **No** | **No** | **No** | **No** | | Cache interaction | Neutral (output side) | Safe in MCP/library, risk in proxy mode | **Safe by construction** (write-time at the tool boundary) | Safe in MCP/hook (+\~13-tok handle re-reads); proxy cache-safe-by-design but lossy | | MCP schema rent | \~940 tok/session skill listing | Yes, in MCP mode | **None** | Yes — 77 tools (dynamic loading: core+session only at startup) | | Reach limit | Visible prose only | Everything in the request | **Bash calls only** — native `Read`/`Edit`/`Grep`/`Glob` bypass it | Broadest — reaches native reads (MCP), shell (hook), history (proxy); no output | | Form factor | Claude Code plugin / skills | pip/npm/docker + Rust core + local ML runtime | Single Rust binary + a PreToolUse hook | **64.7 MB Rust binary** + daemon + dashboard + SQLite + MCP/hook/proxy/HTTP | | Persistent state | None (hooks track tokens) | CCR store + cross-agent memory + `learn` | SQLite history (`rtk gain`) | CCP session + knowledge graph + property graph + BM25 + Context OS | | Self-cost | \~940-tok prefix + 2 hooks | Per-request ML + proxy latency + MCP rent | Hook host-write + hook-conflict surface; \~5–15 ms/cmd | **Largest** — daemon, dashboard, DBs, 77-tool schema, host writes ×34 agents | | Best evidence | Output ultra **−58.5% local** (token) vs 75% claim | **−66.1% self-report** mix; **median 4.8% whole-session** (50k+ sessions); **47.5% independent** | **60–90% per-command, vendor-only**; **no whole-session telemetry, no independent benchmark** | **96–99% on code reads locally reproduced here**; bounce-netted signed ledger; **no independent benchmark, GPT tokenizer** | | Adoption (2026-06-18/20) | 74,446★ / 166 watchers | 33,359★ / 111 watchers | 63,608★ / 146 watchers | 2,800★ / 19 watchers — youngest, least-inflated | | License | Plugin/skill model (MIT) | Apache-2.0 | Apache-2.0 | Apache-2.0 (local free; paid cloud sync) | ## The verdict in one paragraph [#the-verdict-in-one-paragraph] No one product combines everything and gets the best of each — and lean-ctx, the one tool that genuinely *tries* to, illustrates why rather than refuting it. The three specialists win *because* they specialize: caveman by being a free prompt that touches the 5×-priced output class with zero machinery; RTK by being deterministic and cache-safe-by-construction on the one input slice that is both large and concrete (shell output), at zero ML and zero MCP cost; headroom by paying for an ML stage and a proxy to reach the input sources RTK cannot (native reads, RAG, history) and to make compression reversible. lean-ctx adds something real none of them have — a persistent, queryable code graph (the structural-retrieval lever) plus a verification layer that *proves* the saving — but to get there it carries a 64.7 MB binary, a daemon, a dashboard, several databases, a 77-tool schema, and host writes across dozens of agents: every cost the [combining page](/research/token-optimization-tools/06-combining/) predicted a monolith would inherit. So the choice is not "specialists vs the one that does it all"; it is **layered specialists vs an integrated runtime** — the lean stack when you want the cheapest cache-safe win, lean-ctx when you specifically want its code-graph/memory/verification surface and can carry the footprint. Either way, none of the four touches thinking, so none is a 10× story. ## How to read this folder [#how-to-read-this-folder] | Page | What it answers | | ------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | [01 — Caveman design](/research/token-optimization-tools/01-caveman-design/) | How the output-register compressor works: the "prompt, not a program" magic, the six levels, the hooks, the cavecrew/cavemem/cavekit ecosystem. | | [02 — Headroom design](/research/token-optimization-tools/02-headroom-design/) | How the broad input pipeline works: the ContentRouter, the typed compressors, the `kompress-base` ML stage, the live-zone cache-stabilization magic, CCR, learn, the four deployment modes. | | [03 — RTK design](/research/token-optimization-tools/03-rtk-design/) | How the deterministic Bash-boundary compressor works: the six-phase lifecycle, the 12 strategies, the code filter, the hook modes, the reach limit. | | [04 — lean-ctx design](/research/token-optimization-tools/04-leanctx-design/) | How the integrated context runtime works: the superset thesis, what it productizes from each of the other three, the code graph + RRF search the three lacked, CFT Φ-scoring, the verification/proof layer, and the monolith-tax footprint — with first-party build + benchmark numbers. | | [05 — Head-to-head](/research/token-optimization-tools/05-head-to-head/) | The feature has/lacks matrix (now four-way), the internals side-by-side, and the best case for each — where each beats the others. | | [06 — Combining](/research/token-optimization-tools/06-combining/) | Is there one product? lean-ctx as the real test of the monolith thesis, the layered stack, the published head-to-head numbers, integrated-runtime-vs-stack by project shape, and the jackin' adoption order. | | [07 — Evidence and claims](/research/token-optimization-tools/07-evidence-and-claims/) | Benchmarks, what is real vs self-report, the consolidated claim graveyard, adoption-stat caveats, and the validation harness. | | [08 — Records, ledger & unverified](/research/token-optimization-tools/08-records-ledger-and-unverified/) | The formal per-technique records (C1 / H1–H4 / R1 / L1), the full consolidated source ledger, and the unverified-claims register — vague or vendor-only numbers kept and marked "not proven" so nothing prior is lost. | | [09 — Gaps, open questions & next brief](/research/token-optimization-tools/09-gaps-open-questions-and-next-brief/) | What this research still misses: uncompared axes, capabilities a fresh sweep surfaced, the unanswered questions, and a ready-to-paste `/goal` brief for the next round. | | [10 — First-party measurements](/research/token-optimization-tools/10-first-party-measurements/) | The first *measured* (not self-reported) numbers: this repo's token decomposition (94% cache-read), RTK's reach ceiling at **16.5%** of observation tokens vs native `Read` at **76.2%**, and a locally built lean-ctx benchmark (96–99% on code reads, \<10% on prose). Full 6-arm A/B still INCOMPLETE. | | [11 — Extended comparison axes](/research/token-optimization-tools/11-extended-comparison-axes/) | The six axes the earlier pages never compared: security/privacy/supply-chain, project health & sustainability, interaction with Claude Code native context features, build-vs-buy vs an output-style, subscriber tasks-per-cap (which *inverts* the $-per-task order), and non-coding generality. | | [12 — Implementation deep-dive](/research/token-optimization-tools/12-implementation-comparison/) | The first *source-level* pass: all four repos cloned and read, comparing rival **implementations** of each shared feature (shell compression, code outlining, cache-safety, memory, savings accounting, output register) with `file:line` evidence — whose code is better and why. Confirms lean-ctx out-engineers the specialists on most input features (tree-sitter vs regex, bounce-netted signed ledger vs `SUM`, measured cache-safety ratio) yet still has no output register, and logs the source-verified corrections to earlier pages. |