05 — Head-to-head: where each wins, what each lacks

The four design teardowns (caveman, headroom, RTK, lean-ctx) described each tool on its own terms. This page sets them against each other directly: a feature-by-feature has/lacks matrix, the internals side-by-side, and a clear statement of the best case for each — the workload where it beats the others outright.

The framing to keep in mind throughout: three of the four are points on one pipeline; lean-ctx is a runtime drawn across it. Caveman is output; headroom and RTK are two opposite points on input; lean-ctx occupies the input points the other two split and adds a code-graph layer none of them have. So most "comparisons" between caveman and the others are category errors — caveman owns output alone — and the real rivalry is on the input side, where RTK, headroom, and lean-ctx overlap but act at different interception points.

The feature matrix — has (✓), lacks (✗), partial (◐)

Capability	Caveman	Headroom	RTK	lean-ctx
Compresses output (what the model writes)	✓	◐ (optional shaper, off by default)	✗	✗
Compresses input (what the model reads)	✗	✓ (broad)	✓ (Bash only)	✓ (broad)
Reaches native `Read`/`Grep`/`Glob`	✗	✓	✗	✓ (via MCP `ctx_read`)
Reaches RAG chunks / external providers	✗	✓	✗	✓ (provider framework)
Reaches conversation history	✗	✓	✗	◐ (proxy mode, opt-in)
Reaches shell / test / build / log output	✗	✓	✓	✓ (56 pattern modules)
Touches thinking (20% of dollars)	✗	✗	✗	✗
Hits the 5×-priced output class	✓	◐	✗	✗
Deterministic (no ML in the loop)	✓ (it is a prompt)	✗ (`kompress-base`)	✓	✓ by default (ML opt-in)
Reversible / recoverable compression	✗ (re-ask)	✓ (CCR `headroom_retrieve`)	◐ (tee on failure only)	✓ (archive + `ctx_expand`)
Language-aware code outlining (per-read)	✗ (passes code verbatim)	✓ (tree-sitter, 8 langs)	✓ (regex, 10 langs)	✓ (tree-sitter, 21 langs)
Persistent queryable symbol index / code graph	✗	✗	✗	✓ — unique (property graph + BM25 + RRF)
LSP refactoring (rename/references/definition)	✗	✗	✗	✓ — unique
Cross-agent shared memory	◐ (cavemem, single-agent, lossy)	✓ (dedup, reversible)	✗	✓ (CCP + Context OS)
Failure-mining into memory files	✗	✓ (`learn`)	✗	◐ (knowledge/gotcha capture)
Bounce-netted / signed savings ledger	✗	✗	✗ (raw `rtk gain`)	✓ — unique
Cache-safe on Claude Code	✓ (output side, always)	◐ (MCP/library yes; proxy risky)	✓ (by construction)	◐ (MCP/hook yes; proxy cache-safe-by-design but lossy)
Zero MCP schema rent	✗ (~940 tok skill listing)	✗ (in MCP mode)	✓	✗ (77 tools; dynamic loading mitigates)
Zero host-state write	✗ (2 hooks)	◐ (config + model download)	✗ (PreToolUse hook)	✗ (hooks/skills ×34 agents + daemon autostart)
Zero runtime compute	✓	✗ (P50 52 ms / P99 4.2 s)	◐ (~5–15 ms/cmd)	✗ (long-lived daemon + DBs)
Single self-contained artifact	◐ (plugin + hooks)	✗ (Rust core + ML runtime + Python)	✓ (one ~4.1 MB binary)	◐ (one binary, but 64.7 MB + daemon + dashboard + DBs)
CI-safe (preserves exit codes)	n/a (output side)	n/a	✓	✓ (shell hook preserves exit codes)
Multi-surface ecosystem	✓ (the broadest family)	◐ (memory + learn)	◐ (read/grep/find wrappers)	✓ (77 tools, providers, dashboard, team server)
Whole-session telemetry	✗	✓ (50k+ sessions)	✗	◐ (local dashboard; no published fleet telemetry)
Independent third-party benchmark	✗	◐ (one: 47.5%)	✗	✗ (youngest tool)
Locally reproduced headline	✓ (58.5% output)	◐ (mechanisms yes; product no)	✗	✓ (96–99% on code reads, here)

Read the matrix as one output tool, two input specialists, and one input runtime:

Only caveman compresses output. Headroom's output shaper is off by default; RTK and lean-ctx do not touch output at all. This row is uncontested.
headroom and lean-ctx both reach the non-shell input sources (native reads, RAG, history); RTK does not. headroom reaches history natively; lean-ctx reaches it only in proxy mode.
lean-ctx owns two rows alone: the persistent code graph / symbol index (the structural-retrieval lever the three-way said no one had) and LSP refactoring. This is the genuine capability the fourth tool adds to the comparison.
The bottom of the matrix is where cost diverges most: caveman is zero-runtime; RTK is one tiny deterministic binary; headroom pays ML+proxy; lean-ctx pays the most — a 64.7 MB binary, a daemon, databases, and the widest host-write surface — in exchange for being the only one that spans the whole input side plus code intelligence.

The internals side-by-side

Primitive	Caveman	Headroom	RTK	lean-ctx
Interception point	Model's own decoder (a prompt rule)	API request (proxy) or observation (MCP/lib)	Bash tool boundary (PreToolUse hook)	All of them: shell hook + MCP read + proxy
Engine type	Markdown instruction (no code path)	Router + typed compressors + ML model	12 deterministic Rust filters keyed on the command	Tree-sitter AST + entropy/TF-IDF + 56 patterns + BM25/graph; CFT Φ-scoring
Parser / structural	none	per-type (AST outline, JSON, log)	per-command + a `filter.rs` regex code filter (10 langs)	tree-sitter (21 langs) + persistent property graph + call graph
ML in the loop	No	Yes (`kompress-base`, auto-downloaded)	No	No by default; opt-in embeddings + proxy prose
Persistent state	none (hooks only track tokens)	CCR store + cross-agent memory + `learn`	SQLite history (`rtk gain`)	CCP session + knowledge graph + property graph + BM25 + archive
Token counter	tiktoken `o200k_base` (eval only)	own counter, no stated tokenizer	~4 chars/token heuristic	tiktoken `o200k_base` / `cl100k_base` (GPT, not Claude BPE)
Recovery on loss	none (re-ask)	CCR `headroom_retrieve` (reversible)	tee on failure only	archive + `ctx_expand` (reversible, FTS5-searchable)
Host-state write	`~/.claude` hooks ×2	MCP/proxy config + model download	`~/.claude` PreToolUse hook	hooks/skills ×34 agents + daemon autostart (LaunchAgent/systemd)
Runtime cost	~0 compute + ~940-tok prefix	P50 52 ms / P99 4.17 s + ML + MCP rent	~5–15 ms/cmd, ~4 MB binary	daemon + 64.7 MB binary; read 4–12 ms; BM25 ~0.5 ms
Hardest failure	over-terse, unrecoverable	ML drops an identifier; proxy cache-bust	truncates a needed line on a successful command	`map`-mode over-compression (77% quality); stale graph; proxy prose loss

The teardowns confirm the determinism gradient from a new angle: caveman is a zero-machinery prompt; RTK is maximum determinism (fixed rules, no model, single tiny binary); headroom buys breadth by paying for an ML stage, a proxy, and a reversible store; lean-ctx buys the most breadth — every input point plus a code graph — while keeping a deterministic default core, paying instead in footprint. More machinery → more reach and reversibility, but also more latency, more host effects, and a real attack surface.

Where each one wins — the best case for each

Caveman wins when the waste is the model talking too much

   BEST CASE: CAVEMAN
   ───────────────────
   symptom   the model writes long explanations, restates code it just
             edited, narrates what it is about to do
   why it    output is the 5×-priced token class AND cache-neutral, so every
   wins      token shaved is worth ~5× an input token and costs nothing in
             cache risk; it is a free prompt with zero runtime
   margin    the ONLY tool that touches output at all; headroom's shaper is
   over      off-by-default and weaker, RTK and lean-ctx can't see output.
   rivals    No contest — caveman owns this slice outright.
   also      works under any agent/model (it is just a register instruction),
   unique    and the family extends to commits, reviews, and subagent reports

Caveman is uncontested on output. It is also the first tool to adopt for a separate reason: it is the only one that is unconditionally cache-safe and requires no runtime, no binary, no host service — minutes to adopt, nothing to provision.

RTK wins when the waste is verbose shell output and you want zero footprint

   BEST CASE: RTK
   ──────────────
   symptom   Bash-heavy workload: repeated `cargo test`, `git status`/`diff`,
             build logs, `pytest`/`go test`, lint output flooding context
   why it    deterministic (no ML to mis-fire), cache-safe BY CONSTRUCTION,
   wins      zero MCP rent, ONE ~4 MB binary, CI-safe (exit codes preserved)
   margin    vs headroom on the SAME shell output: no ML attack surface, no
   over      model latency, no proxy. vs lean-ctx: same write-time safety in
   rivals    1/16th the footprint — no daemon, no DBs, no 77-tool schema.
   also      the MOST container-adoptable of the four (tiny single binary,
   unique    deterministic, nothing to provision); 100+ command formats turnkey

RTK's win is the cheapest way to compress the largest concrete input slice — shell output — deterministically and cache-safely. lean-ctx does the same shell compression, but RTK does only that, in a fraction of the footprint; when shell output is the whole problem, RTK's minimalism beats lean-ctx's breadth.

Headroom wins when the waste is history and RAG on the wire, reversibly

   BEST CASE: HEADROOM
   ───────────────────
   symptom   large JSON/API payloads, RAG chunks, long conversation history
             on the wire; multi-tool (Claude+Codex+Gemini) workflows needing
             shared, reversible, deduplicated memory
   why it    reaches everything in the request (incl. history) reversibly via
   wins      CCR, with production telemetry and one independent measurement —
             the best-evidenced of the four
   margin    vs RTK: sees non-shell input RTK is blind to. vs lean-ctx: a
   over      proven cross-agent memory + the only published whole-session
   rivals    telemetry; lean-ctx's equivalents are younger and unbenchmarked.
   also      `learn` failure-mining + cross-agent dedup memory + the most
   unique    independent evidence of any tool here

Headroom's win is reach-with-evidence on the API wire, especially conversation history (which lean-ctx reaches only in its opt-in proxy) and cross-agent memory, backed by the only third-party measurement and fleet telemetry in the group.

lean-ctx wins when you want the code graph and memory in one runtime

   BEST CASE: LEAN-CTX
   ───────────────────
   symptom   large code-read-heavy work in a medium/large repo where you ALSO
             want "where does this ripple to?", ranked search, cross-session
             memory, and an auditable savings receipt — all at once
   why it    the ONLY tool that bundles a persistent code graph (impact/
   wins      callgraph/RRF search) + LSP refactor + CCP memory + a signed,
             bounce-netted savings ledger behind one deterministic-by-default
             binary, while also doing RTK's shell + headroom's reads
   margin    vs all three: it is the only one with structural retrieval and
   over      verification. vs the layered stack: one install, one config,
   rivals    one savings ledger instead of three tools to reconcile.
   also      reproduced here at 96–99% on code reads; cleanest open-core
   unique    (local free forever); most honest savings accounting (bounce-net)

lean-ctx's win is consolidation plus the code-graph lever: when the workload genuinely needs structural retrieval, memory, and broad input compression together — and you are willing to run a daemon-class tool — one runtime beats assembling three. Its cost is the footprint and the lack of independent evidence; its edge is being the only tool here that answers "where is foo used?" without a re-read and proves what it saved.

Quick selection guide

If the waste is…	Reach for	Why
The model writing too much prose / restating code	caveman	output class, 5×-priced, cache-neutral
Verbose `cargo test` / `git` / build / log output run through Bash	RTK (or lean-ctx hook)	deterministic, cache-safe at the tool boundary, zero MCP rent — RTK if footprint matters
Big native-tool file reads, RAG chunks, long history on the wire	headroom (MCP / live-zone)	broad API-layer reach + reversible recall + the best evidence
Whole files re-read just to see structure; "where does `foo` ripple?"	lean-ctx (or a standalone code-intelligence tool)	persistent code graph / RRF search — the structural-retrieval lever, now bundled
Code-read-heavy work that ALSO needs memory + verification, one tool	lean-ctx	consolidates code graph + memory + broad compression + a signed ledger
Thinking tokens (20% of dollars)	none of them	effort routing / model selection — the unmoved wall

Cache-safety, compared

Cache interaction is the make-or-break axis for any input compressor on an already-caching Claude Code, and it separates the four:

Tool / mode	Where it acts	Cache interaction	ML in hot path	MCP rent
caveman	Model's generated prose (output)	Neutral — never touches the prefix	no	~940 tok skill listing
RTK	New Bash command output, at the tool boundary	Safe by construction — the compressed text is what gets cached	no	none
lean-ctx (hook + MCP read)	Shell output + native reads, write-time; ~13-tok handle re-reads	Safe (write-time) + prefix-friendly ordering	no (default)	yes (77 tools, dynamic)
headroom (MCP)	A new observation, on demand	Safe (write-time)	yes (`kompress-base`)	yes
lean-ctx (proxy)	Frozen-region prose rewrite `[prefix, boundary)`	Cache-safe by design (instrumented ratio) but lossy on prose	opt-in	n/a
headroom (proxy)	Rewrites the whole request	Risk — can churn the prefix Claude Code already caches	yes	n/a
Whole-prompt proxy (LLMLingua-style)	Rewrites the whole request	Breaks the cache — must beat ~5.5–10×	yes	n/a

RTK occupies the safest corner: write-time, deterministic, native-hook, no model, tiny. lean-ctx and headroom are both safe in the modes that matter (write-time MCP/hook) and carry proxy modes that need care — headroom's proxy is the riskiest (whole-request), lean-ctx's is cache-safe-by-design but still lossy on prose. Caveman is trivially safe because it is output-side.

Evidence quality, compared

Adoption stars are PR-inflated for three of the four and must be ignored as a quality signal. What separates them is the kind of evidence behind the headline:

Tool	Best evidence	Weakest spot
caveman	Locally reproduced 58.5% output-token cut; mechanism is transparent (it is a prompt)	No agentic-task quality benchmark of register-compressed output exists anywhere
headroom	Production telemetry across 50k+ sessions (median 4.8%) + one independent 47.5% + academic backing for the write-time pattern	Product percentages are vendor self-report; the ML stage is unbenchmarked on code quality
RTK	The underlying levers (log filter −94.2%, JSON minify −34.3%) are locally reproduced in the dossier	No whole-session telemetry and no independent benchmark of RTK itself
lean-ctx	Reproduced here: 96–99% on code reads, <10% on prose/config; the most honest self-accounting (bounce-netted, signed ledger); 2,900+ tests	No independent third-party benchmark; youngest tool; GPT-tokenizer self-measurement; `map`-mode quality only 77%

On evidence, headroom is the best-externally-instrumented, caveman is the most transparent (you can read the mechanism), lean-ctx is the best-self-instrumented (it nets out its own waste and signs the ledger) but the least externally verified, and RTK is the least verified of all.

What none of them can do

Two limits still bind all four, and lean-ctx removes a third that bound the original three:

None touches thinking (20% of dollars). Thinking bills as output, is invisible in the transcript, and on Fable 5 cannot even be disabled. No register instruction, no input filter, no observation compressor, and no code graph reaches it — only the effort lever and model routing do. This is the largest single bucket none of the four moves.
The persistent-symbol-index gap is now half-closed. The three-way version said none of the three could answer "where is foo defined?" without re-reading. lean-ctx changes that — its property graph + call graph + RRF search are exactly that structural-retrieval lever (the ast-grep / codedb class the dossier's code-intelligence chapter covers). caveman, headroom, and RTK still lack it; lean-ctx is the one tool here that has it.
None converts a per-payload ratio into a whole-bill dollar saving for free. Every headline — caveman's 75%, headroom's 60–95%, RTK's 60–90%, lean-ctx's "up to 99%" — is per-payload, per-command, or per-session; the whole-bill effect is bounded by how much of the bill that class represents and by the 0.1× cache-read discount most input tokens already enjoy.

Next: 06 — Combining — whether one product can be the best of each, why lean-ctx is the real test of that question, and the layered stack that is still the answer for most.

05 — Head-to-head: where each wins, what each lacks

On this page