jackin'
ResearchToken-optimization tools

05 — Head-to-head: where each wins, what each lacks

05 — Head-to-head: where each wins, what each lacks

The four design teardowns (caveman, headroom, RTK, lean-ctx) described each tool on its own terms. This page sets them against each other directly: a feature-by-feature has/lacks matrix, the internals side-by-side, and a clear statement of the best case for each — the workload where it beats the others outright.

The framing to keep in mind throughout: three of the four are points on one pipeline; lean-ctx is a runtime drawn across it. Caveman is output; headroom and RTK are two opposite points on input; lean-ctx occupies the input points the other two split and adds a code-graph layer none of them have. So most "comparisons" between caveman and the others are category errors — caveman owns output alone — and the real rivalry is on the input side, where RTK, headroom, and lean-ctx overlap but act at different interception points.

The feature matrix — has (✓), lacks (✗), partial (◐)

CapabilityCavemanHeadroomRTKlean-ctx
Compresses output (what the model writes)◐ (optional shaper, off by default)
Compresses input (what the model reads)✓ (broad)✓ (Bash only)✓ (broad)
Reaches native Read/Grep/Glob✓ (via MCP ctx_read)
Reaches RAG chunks / external providers✓ (provider framework)
Reaches conversation history◐ (proxy mode, opt-in)
Reaches shell / test / build / log output✓ (56 pattern modules)
Touches thinking (20% of dollars)
Hits the 5×-priced output class
Deterministic (no ML in the loop)✓ (it is a prompt)✗ (kompress-base)by default (ML opt-in)
Reversible / recoverable compression✗ (re-ask)✓ (CCR headroom_retrieve)◐ (tee on failure only)✓ (archive + ctx_expand)
Language-aware code outlining (per-read)✗ (passes code verbatim)✓ (tree-sitter, 8 langs)✓ (regex, 10 langs)✓ (tree-sitter, 21 langs)
Persistent queryable symbol index / code graph— unique (property graph + BM25 + RRF)
LSP refactoring (rename/references/definition)— unique
Cross-agent shared memory◐ (cavemem, single-agent, lossy)✓ (dedup, reversible)✓ (CCP + Context OS)
Failure-mining into memory files✓ (learn)◐ (knowledge/gotcha capture)
Bounce-netted / signed savings ledger✗ (raw rtk gain)— unique
Cache-safe on Claude Code✓ (output side, always)◐ (MCP/library yes; proxy risky)✓ (by construction)◐ (MCP/hook yes; proxy cache-safe-by-design but lossy)
Zero MCP schema rent✗ (~940 tok skill listing)✗ (in MCP mode)✗ (77 tools; dynamic loading mitigates)
Zero host-state write✗ (2 hooks)◐ (config + model download)✗ (PreToolUse hook)✗ (hooks/skills ×34 agents + daemon autostart)
Zero runtime compute✗ (P50 52 ms / P99 4.2 s)◐ (~5–15 ms/cmd)✗ (long-lived daemon + DBs)
Single self-contained artifact◐ (plugin + hooks)✗ (Rust core + ML runtime + Python)✓ (one ~4.1 MB binary)◐ (one binary, but 64.7 MB + daemon + dashboard + DBs)
CI-safe (preserves exit codes)n/a (output side)n/a✓ (shell hook preserves exit codes)
Multi-surface ecosystem✓ (the broadest family)◐ (memory + learn)◐ (read/grep/find wrappers)✓ (77 tools, providers, dashboard, team server)
Whole-session telemetry✓ (50k+ sessions)◐ (local dashboard; no published fleet telemetry)
Independent third-party benchmark◐ (one: 47.5%)✗ (youngest tool)
Locally reproduced headline✓ (58.5% output)◐ (mechanisms yes; product no)✓ (96–99% on code reads, here)

Read the matrix as one output tool, two input specialists, and one input runtime:

  • Only caveman compresses output. Headroom's output shaper is off by default; RTK and lean-ctx do not touch output at all. This row is uncontested.
  • headroom and lean-ctx both reach the non-shell input sources (native reads, RAG, history); RTK does not. headroom reaches history natively; lean-ctx reaches it only in proxy mode.
  • lean-ctx owns two rows alone: the persistent code graph / symbol index (the structural-retrieval lever the three-way said no one had) and LSP refactoring. This is the genuine capability the fourth tool adds to the comparison.
  • The bottom of the matrix is where cost diverges most: caveman is zero-runtime; RTK is one tiny deterministic binary; headroom pays ML+proxy; lean-ctx pays the most — a 64.7 MB binary, a daemon, databases, and the widest host-write surface — in exchange for being the only one that spans the whole input side plus code intelligence.

The internals side-by-side

PrimitiveCavemanHeadroomRTKlean-ctx
Interception pointModel's own decoder (a prompt rule)API request (proxy) or observation (MCP/lib)Bash tool boundary (PreToolUse hook)All of them: shell hook + MCP read + proxy
Engine typeMarkdown instruction (no code path)Router + typed compressors + ML model12 deterministic Rust filters keyed on the commandTree-sitter AST + entropy/TF-IDF + 56 patterns + BM25/graph; CFT Φ-scoring
Parser / structuralnoneper-type (AST outline, JSON, log)per-command + a filter.rs regex code filter (10 langs)tree-sitter (21 langs) + persistent property graph + call graph
ML in the loopNoYes (kompress-base, auto-downloaded)NoNo by default; opt-in embeddings + proxy prose
Persistent statenone (hooks only track tokens)CCR store + cross-agent memory + learnSQLite history (rtk gain)CCP session + knowledge graph + property graph + BM25 + archive
Token countertiktoken o200k_base (eval only)own counter, no stated tokenizer~4 chars/token heuristictiktoken o200k_base / cl100k_base (GPT, not Claude BPE)
Recovery on lossnone (re-ask)CCR headroom_retrieve (reversible)tee on failure onlyarchive + ctx_expand (reversible, FTS5-searchable)
Host-state write~/.claude hooks ×2MCP/proxy config + model download~/.claude PreToolUse hookhooks/skills ×34 agents + daemon autostart (LaunchAgent/systemd)
Runtime cost~0 compute + ~940-tok prefixP50 52 ms / P99 4.17 s + ML + MCP rent~5–15 ms/cmd, ~4 MB binarydaemon + 64.7 MB binary; read 4–12 ms; BM25 ~0.5 ms
Hardest failureover-terse, unrecoverableML drops an identifier; proxy cache-busttruncates a needed line on a successful commandmap-mode over-compression (77% quality); stale graph; proxy prose loss

The teardowns confirm the determinism gradient from a new angle: caveman is a zero-machinery prompt; RTK is maximum determinism (fixed rules, no model, single tiny binary); headroom buys breadth by paying for an ML stage, a proxy, and a reversible store; lean-ctx buys the most breadth — every input point plus a code graph — while keeping a deterministic default core, paying instead in footprint. More machinery → more reach and reversibility, but also more latency, more host effects, and a real attack surface.

Where each one wins — the best case for each

Caveman wins when the waste is the model talking too much

   BEST CASE: CAVEMAN
   ───────────────────
   symptom   the model writes long explanations, restates code it just
             edited, narrates what it is about to do
   why it    output is the 5×-priced token class AND cache-neutral, so every
   wins      token shaved is worth ~5× an input token and costs nothing in
             cache risk; it is a free prompt with zero runtime
   margin    the ONLY tool that touches output at all; headroom's shaper is
   over      off-by-default and weaker, RTK and lean-ctx can't see output.
   rivals    No contest — caveman owns this slice outright.
   also      works under any agent/model (it is just a register instruction),
   unique    and the family extends to commits, reviews, and subagent reports

Caveman is uncontested on output. It is also the first tool to adopt for a separate reason: it is the only one that is unconditionally cache-safe and requires no runtime, no binary, no host service — minutes to adopt, nothing to provision.

RTK wins when the waste is verbose shell output and you want zero footprint

   BEST CASE: RTK
   ──────────────
   symptom   Bash-heavy workload: repeated `cargo test`, `git status`/`diff`,
             build logs, `pytest`/`go test`, lint output flooding context
   why it    deterministic (no ML to mis-fire), cache-safe BY CONSTRUCTION,
   wins      zero MCP rent, ONE ~4 MB binary, CI-safe (exit codes preserved)
   margin    vs headroom on the SAME shell output: no ML attack surface, no
   over      model latency, no proxy. vs lean-ctx: same write-time safety in
   rivals    1/16th the footprint — no daemon, no DBs, no 77-tool schema.
   also      the MOST container-adoptable of the four (tiny single binary,
   unique    deterministic, nothing to provision); 100+ command formats turnkey

RTK's win is the cheapest way to compress the largest concrete input slice — shell output — deterministically and cache-safely. lean-ctx does the same shell compression, but RTK does only that, in a fraction of the footprint; when shell output is the whole problem, RTK's minimalism beats lean-ctx's breadth.

Headroom wins when the waste is history and RAG on the wire, reversibly

   BEST CASE: HEADROOM
   ───────────────────
   symptom   large JSON/API payloads, RAG chunks, long conversation history
             on the wire; multi-tool (Claude+Codex+Gemini) workflows needing
             shared, reversible, deduplicated memory
   why it    reaches everything in the request (incl. history) reversibly via
   wins      CCR, with production telemetry and one independent measurement —
             the best-evidenced of the four
   margin    vs RTK: sees non-shell input RTK is blind to. vs lean-ctx: a
   over      proven cross-agent memory + the only published whole-session
   rivals    telemetry; lean-ctx's equivalents are younger and unbenchmarked.
   also      `learn` failure-mining + cross-agent dedup memory + the most
   unique    independent evidence of any tool here

Headroom's win is reach-with-evidence on the API wire, especially conversation history (which lean-ctx reaches only in its opt-in proxy) and cross-agent memory, backed by the only third-party measurement and fleet telemetry in the group.

lean-ctx wins when you want the code graph and memory in one runtime

   BEST CASE: LEAN-CTX
   ───────────────────
   symptom   large code-read-heavy work in a medium/large repo where you ALSO
             want "where does this ripple to?", ranked search, cross-session
             memory, and an auditable savings receipt — all at once
   why it    the ONLY tool that bundles a persistent code graph (impact/
   wins      callgraph/RRF search) + LSP refactor + CCP memory + a signed,
             bounce-netted savings ledger behind one deterministic-by-default
             binary, while also doing RTK's shell + headroom's reads
   margin    vs all three: it is the only one with structural retrieval and
   over      verification. vs the layered stack: one install, one config,
   rivals    one savings ledger instead of three tools to reconcile.
   also      reproduced here at 96–99% on code reads; cleanest open-core
   unique    (local free forever); most honest savings accounting (bounce-net)

lean-ctx's win is consolidation plus the code-graph lever: when the workload genuinely needs structural retrieval, memory, and broad input compression together — and you are willing to run a daemon-class tool — one runtime beats assembling three. Its cost is the footprint and the lack of independent evidence; its edge is being the only tool here that answers "where is foo used?" without a re-read and proves what it saved.

Quick selection guide

If the waste is…Reach forWhy
The model writing too much prose / restating codecavemanoutput class, 5×-priced, cache-neutral
Verbose cargo test / git / build / log output run through BashRTK (or lean-ctx hook)deterministic, cache-safe at the tool boundary, zero MCP rent — RTK if footprint matters
Big native-tool file reads, RAG chunks, long history on the wireheadroom (MCP / live-zone)broad API-layer reach + reversible recall + the best evidence
Whole files re-read just to see structure; "where does foo ripple?"lean-ctx (or a standalone code-intelligence tool)persistent code graph / RRF search — the structural-retrieval lever, now bundled
Code-read-heavy work that ALSO needs memory + verification, one toollean-ctxconsolidates code graph + memory + broad compression + a signed ledger
Thinking tokens (20% of dollars)none of themeffort routing / model selection — the unmoved wall

Cache-safety, compared

Cache interaction is the make-or-break axis for any input compressor on an already-caching Claude Code, and it separates the four:

Tool / modeWhere it actsCache interactionML in hot pathMCP rent
cavemanModel's generated prose (output)Neutral — never touches the prefixno~940 tok skill listing
RTKNew Bash command output, at the tool boundarySafe by construction — the compressed text is what gets cachednonone
lean-ctx (hook + MCP read)Shell output + native reads, write-time; ~13-tok handle re-readsSafe (write-time) + prefix-friendly orderingno (default)yes (77 tools, dynamic)
headroom (MCP)A new observation, on demandSafe (write-time)yes (kompress-base)yes
lean-ctx (proxy)Frozen-region prose rewrite [prefix, boundary)Cache-safe by design (instrumented ratio) but lossy on proseopt-inn/a
headroom (proxy)Rewrites the whole requestRisk — can churn the prefix Claude Code already cachesyesn/a
Whole-prompt proxy (LLMLingua-style)Rewrites the whole requestBreaks the cache — must beat ~5.5–10×yesn/a

RTK occupies the safest corner: write-time, deterministic, native-hook, no model, tiny. lean-ctx and headroom are both safe in the modes that matter (write-time MCP/hook) and carry proxy modes that need care — headroom's proxy is the riskiest (whole-request), lean-ctx's is cache-safe-by-design but still lossy on prose. Caveman is trivially safe because it is output-side.

Evidence quality, compared

Adoption stars are PR-inflated for three of the four and must be ignored as a quality signal. What separates them is the kind of evidence behind the headline:

ToolBest evidenceWeakest spot
cavemanLocally reproduced 58.5% output-token cut; mechanism is transparent (it is a prompt)No agentic-task quality benchmark of register-compressed output exists anywhere
headroomProduction telemetry across 50k+ sessions (median 4.8%) + one independent 47.5% + academic backing for the write-time patternProduct percentages are vendor self-report; the ML stage is unbenchmarked on code quality
RTKThe underlying levers (log filter −94.2%, JSON minify −34.3%) are locally reproduced in the dossierNo whole-session telemetry and no independent benchmark of RTK itself
lean-ctxReproduced here: 96–99% on code reads, <10% on prose/config; the most honest self-accounting (bounce-netted, signed ledger); 2,900+ testsNo independent third-party benchmark; youngest tool; GPT-tokenizer self-measurement; map-mode quality only 77%

On evidence, headroom is the best-externally-instrumented, caveman is the most transparent (you can read the mechanism), lean-ctx is the best-self-instrumented (it nets out its own waste and signs the ledger) but the least externally verified, and RTK is the least verified of all.

What none of them can do

Two limits still bind all four, and lean-ctx removes a third that bound the original three:

  1. None touches thinking (20% of dollars). Thinking bills as output, is invisible in the transcript, and on Fable 5 cannot even be disabled. No register instruction, no input filter, no observation compressor, and no code graph reaches it — only the effort lever and model routing do. This is the largest single bucket none of the four moves.
  2. The persistent-symbol-index gap is now half-closed. The three-way version said none of the three could answer "where is foo defined?" without re-reading. lean-ctx changes that — its property graph + call graph + RRF search are exactly that structural-retrieval lever (the ast-grep / codedb class the dossier's code-intelligence chapter covers). caveman, headroom, and RTK still lack it; lean-ctx is the one tool here that has it.
  3. None converts a per-payload ratio into a whole-bill dollar saving for free. Every headline — caveman's 75%, headroom's 60–95%, RTK's 60–90%, lean-ctx's "up to 99%" — is per-payload, per-command, or per-session; the whole-bill effect is bounded by how much of the bill that class represents and by the 0.1× cache-read discount most input tokens already enjoy.

Next: 06 — Combining — whether one product can be the best of each, why lean-ctx is the real test of that question, and the layered stack that is still the answer for most.

On this page