jackin'
ResearchToken-optimization tools

Token-optimization tools — caveman, headroom, RTK & lean-ctx, compared

Token-optimization tools — a deep comparison

This folder is a dedicated, self-contained, diagram-driven teardown of the token-optimization tools an operator most often weighs for a Claude Code workflow: caveman, headroom, RTK, and lean-ctx. It collects, consolidates, and deepens the material that was previously spread across the token-optimization dossier (the headroom deep-dive, the compression-market sweep, and the RTK chapter) into one place, and adds what those files did not have: equal-depth design teardowns of every tool, flowcharts of how each one actually works, a feature-by-feature has/lacks matrix, and a straight answer to the operator's question — is there one product that combines them all, and if not, where does each win?

The folder began as a three-way (caveman/headroom/RTK) comparison; lean-ctx was added in a later round because it is the missing fourth data point — the integrated "context runtime" that tries to be all three at once and ships the code-intelligence lever the three-way concluded none of the three had. The folder name and page structure are deliberately generic so further tools can join the same matrices without another rename.

Most claims trace to primary sources already cited in the dossier; this folder is largely a re-synthesis. Research dates: headroom and the compression market were swept 2026-06-15; RTK was swept 2026-06-18; lean-ctx was swept, source-audited, and locally built + benchmarked 2026-06-20. Product percentages are vendor self-reported unless explicitly marked as locally reproduced or independently measured, and are tiered T1–T4 throughout. The broader economics (the modeled session profile, the dollar split, the 10× verdict) live in the token-optimization dossier; this folder assumes them and focuses on the tools.

This is the canonical, deepest comparison of these tools. The dossier chapters it grew out of — 53 (headroom), 54 (compression market), and 56 (RTK) — remain in place for their broader scope and cross-references, and now point here for the head-to-head.

TL;DR

  • Three of the four are points on one pipeline; the fourth tries to be the whole pipeline. Caveman compresses what the model writes (visible output, ~17% of a heavy session's dollars). Headroom and RTK both compress what the model reads (input, the ~61% cache buckets) at opposite ends of a breadth/determinism trade. lean-ctx is a different category — an integrated context runtime that occupies every interception point the other three split between them (shell hook + MCP read + proxy) and adds a persistent code graph plus a verification layer.
  • lean-ctx is the superset monolith this hub argued no one was building — and it largely confirms the prediction. It does reach further than any single specialist (it is the only one that bundles a persistent, queryable code graph — the structural-retrieval lever the three-way said all three lacked). But it pays for that reach with the largest footprint of the four by far: a 64.7 MB binary, a daemon, a dashboard, SQLite stores, 77 MCP tools, and host writes across up to 34 agents. The combining page weighs "integrated runtime" against "layered specialists" honestly.
  • No single tool does everything well, and the layered stack still composes. The "best of each" remains a stack: one output tool (caveman), one input layer (RTK at the Bash boundary, headroom on the wire, or lean-ctx as the runtime). Caveman never overlaps the input tools, so it always composes. The input tools overlap and should not be doubled blindly. Headroom even ships a tokens_saved_rtk data plane — vendor-side proof the input tools are designed to layer, not merge.
  • Every tool is real on its target bucket and every tool is over-marketed by the same category error. "75%" (caveman), "60–95%" (headroom), "60–90%" (RTK), and "up to 99%" (lean-ctx) are per-payload or per-session best cases, not whole-bill dollar savings. Corrected to the whole bill, each lands in the low double digits of dollars at best, because most input tokens already read at the 0.1× cache price and none of the four touches the 20%-of-dollars thinking bucket. There is no 10× here; the dossier's ≈2.5× / ≈5–6.2×-with-routing verdict is unmoved.
  • Rank by evidence, not stars. Three repos carry PR-inflated star counts (caveman 74.4k, RTK 63.6k, headroom 33.4k) with abnormally low watcher ratios; lean-ctx is the youngest and least-inflated (2.8k★, README-honest about it) but has no independent third-party benchmark. On evidence: caveman has a locally reproduced 58.5% output cut; headroom has production telemetry (median 4.8% whole-session) plus one independent 47.5%; RTK has no whole-session telemetry and no independent benchmark; lean-ctx reproduces locally here (96–99% on code reads, <10% on prose/config) with the most honest self-instrumentation (bounce-netted, signed ledger) but is self-measured on a GPT tokenizer.
  • The sweet spot is still two layers, not the kitchen sink. Caveman (output) plus one input layer captures the bulk of the realizable win. Reach for lean-ctx when you specifically want the code-graph / memory / verification surface and can carry its footprint; otherwise the lean caveman + RTK (or caveman + headroom-MCP) stack is lower-risk.

The one idea that organizes everything: where each tool intercepts

A coding agent's token bill splits into a small number of buckets. Each tool's entire character — its risk, its reach, its cache behavior — follows from where in the pipeline it acts. Three of the four bite on one place each; lean-ctx bites on several at once.

                THE CODING-AGENT TOKEN BILL  (heavy modeled session)
   ┌──────────────────────────────────────────────────────────────────────┐
   │  INPUT  (what the model READS)                          61% of $       │
   │    • system prompt + CLAUDE.md + tool schemas                          │
   │    • conversation history                                              │
   │    • tool outputs / logs / build & test output                        │
   │    • file reads / RAG chunks / search results                         │
   │      └── billed as cache-write (29%) + cache-read (32%)                │
   │                                                                        │
   │  THINKING  (reasoning tokens, billed as output, invisible)  20% of $   │
   │                                                                        │
   │  VISIBLE OUTPUT  (the prose the model writes back)          17% of $   │
   │                                                                        │
   │  uncached input                                              2% of $   │
   └──────────────────────────────────────────────────────────────────────┘

        CAVEMAN  ─────────────────────────────────►  VISIBLE OUTPUT
                 shrinks what the model writes              (17%)

        RTK      ─────────────────────────────────►  a SLICE of INPUT:
                 Bash command output, at the tool          shell output only
                 boundary, before it enters context        (part of the 61%)

        HEADROOM ─────────────────────────────────►  BROAD INPUT:
                 tool outputs, files, RAG, history,        most of the 61%
                 on the API wire or via MCP

        LEAN-CTX ════╗═══════════════════════════►  Bash output (shell hook)
                     ╠═══════════════════════════►  native reads (MCP, 10 modes)
                     ╠═══════════════════════════►  history/prose (proxy, opt-in)
                     ╚═══════════════════════════►  + PERSISTENT CODE GRAPH below
                 one runtime spanning every input point + structural retrieval

        (nothing here)  ─────────────────────────►  THINKING  (20%)
                 only the effort / model-routing levers reach it

Read off this diagram the facts the rest of the folder elaborates:

  1. Caveman never overlaps with any of the others. It works on output; they work on input. Running caveman plus any input tool is strictly additive — no double-counting is even possible.
  2. Headroom and RTK look like they overlap, but they act at different points. RTK filters a command's output at the Bash tool boundary, before it is ever sent; headroom compresses the request on the API wire (or an observation via MCP), which includes whatever RTK already filtered plus everything RTK never sees. They compose; the only redundancy is re-compressing the exact same bytes twice.
  3. RTK's reach is a strict subset of headroom's reach. Anything that does not flow through a shell — native Read, Grep, Glob, RAG, conversation history — is invisible to RTK and visible to headroom. RTK buys cache-safety and zero ML cost by giving up reach.
  4. lean-ctx tries to occupy all of those points in one process — shell hook (RTK's slot), MCP read (headroom-MCP's slot), proxy (headroom-proxy's slot) — and then adds a layer below the read that none of the others have: a persistent, queryable code graph (the structural-retrieval lever). It is not a fourth point on the line; it is a runtime drawn across the line. The cost of that is footprint, detailed throughout.
  5. The biggest single bucket after cache-reads — thinking, 20% — is untouched by all four. This is why no stack of these tools reaches 10×.

The determinism gradient

The tools line up on a single axis from "no machinery at all" to "a full ML runtime." This gradient predicts their latency, their host footprint, their failure modes, and their reach. The first three are points on it; lean-ctx spans it — its default core is as deterministic as RTK, but its opt-in layers (embeddings, proxy prose rewrite) reach into headroom's ML/proxy territory.

   LESS MACHINERY                                                MORE MACHINERY
   (less reach, less risk)                          (more reach, more reversibility)
   ◄──────────────────────────────────────────────────────────────────────────►

   CAVEMAN                 RTK                  HEADROOM            LEAN-CTX
   ───────                 ───                  ────────            ────────
   a PROMPT, not           12 deterministic     router + typed      a RUNTIME spanning
   a program               Rust filters         compressors + a     the whole axis:
   (model compresses       keyed on the         trained ML model    deterministic core
    its own decoding)      command; no model    (kompress-base)     (tree-sitter/BM25/
                                                                    entropy) + OPT-IN
                                                                    embeddings & proxy

   ML in loop:  NO         NO                   YES (always)        NO by default;
                                                                    opt-in embeddings/proxy
   recovery:    none       tee on FAIL          CCR retrieve        archive + ctx_expand
   runtime:     ~0         5–15 ms/cmd          P50 52ms/P99 4.2s   daemon + 64.7 MB binary
   reach:       output     Bash output          everything on wire  every input point +
                                                                    persistent code graph

More machinery buys breadth and reversibility, but it costs latency, host effects, and a real attack surface. Caveman is the zero-machinery extreme; RTK is maximum determinism without an ML stage; headroom pays for an ML model to buy reach across every input source; lean-ctx is deterministic-by-default but the broadest of all, paying in footprint rather than mandatory ML.

Master comparison table

AxisCavemanHeadroomRTKlean-ctx
One-line identityOutput-register compressorBroad input-compression pipelineDeterministic Bash-output compressorIntegrated context runtime
DirectionOutput — what the model writesInput — what the model reads (broad)Input — what the model reads (Bash only)Input, all of it + code-graph retrieval (no output)
Primary bucketVisible prose (~17% of $)Tool outputs / logs / RAG / files / history (the 61% cache lines)Shell-command output (a slice of the 61%)Native reads + shell + history + providers (most of the 61%)
Engine typeMarkdown instruction (no code path)Router + typed compressors + ML model12 deterministic Rust filters keyed on the commandTree-sitter AST + entropy/TF-IDF + 56 shell patterns + BM25/graph; opt-in embeddings
MechanismSkill-prompt register change (terse English / Classical Chinese)Typed compressors (AST/JSON/log/search) + trained text model + drop-low-valueFilter / group / truncate / dedup on 100+ known command formats10 read modes + handle cache + CFT Φ-scoring + knapsack compiler
ML in the hot pathNoYes (kompress-base, auto-downloaded)NoNo by default (deterministic core); opt-in embeddings + proxy prose
LossinessLossy, no recoveryReversible (originals in CCR via headroom_retrieve)Lossy; tee-recovery on command failure onlyReversible (archive + ctx_expand, FTS5-searchable)
Touches code?Passes code/diffs verbatimOutlines bodies, passes raw code ~0%Trims cat/grep; language-aware regex code filter (10 langs, per-read)Its strongest case — tree-sitter outline 96–99% on code (per-read)
Touches thinking?NoNoNoNo
Cache interactionNeutral (output side)Safe in MCP/library, risk in proxy modeSafe by construction (write-time at the tool boundary)Safe in MCP/hook (+~13-tok handle re-reads); proxy cache-safe-by-design but lossy
MCP schema rent~940 tok/session skill listingYes, in MCP modeNoneYes — 77 tools (dynamic loading: core+session only at startup)
Reach limitVisible prose onlyEverything in the requestBash calls only — native Read/Edit/Grep/Glob bypass itBroadest — reaches native reads (MCP), shell (hook), history (proxy); no output
Form factorClaude Code plugin / skillspip/npm/docker + Rust core + local ML runtimeSingle Rust binary + a PreToolUse hook64.7 MB Rust binary + daemon + dashboard + SQLite + MCP/hook/proxy/HTTP
Persistent stateNone (hooks track tokens)CCR store + cross-agent memory + learnSQLite history (rtk gain)CCP session + knowledge graph + property graph + BM25 + Context OS
Self-cost~940-tok prefix + 2 hooksPer-request ML + proxy latency + MCP rentHook host-write + hook-conflict surface; ~5–15 ms/cmdLargest — daemon, dashboard, DBs, 77-tool schema, host writes ×34 agents
Best evidenceOutput ultra −58.5% local (token) vs 75% claim−66.1% self-report mix; median 4.8% whole-session (50k+ sessions); 47.5% independent60–90% per-command, vendor-only; no whole-session telemetry, no independent benchmark96–99% on code reads locally reproduced here; bounce-netted signed ledger; no independent benchmark, GPT tokenizer
Adoption (2026-06-18/20)74,446★ / 166 watchers33,359★ / 111 watchers63,608★ / 146 watchers2,800★ / 19 watchers — youngest, least-inflated
LicensePlugin/skill model (MIT)Apache-2.0Apache-2.0Apache-2.0 (local free; paid cloud sync)

The verdict in one paragraph

No one product combines everything and gets the best of each — and lean-ctx, the one tool that genuinely tries to, illustrates why rather than refuting it. The three specialists win because they specialize: caveman by being a free prompt that touches the 5×-priced output class with zero machinery; RTK by being deterministic and cache-safe-by-construction on the one input slice that is both large and concrete (shell output), at zero ML and zero MCP cost; headroom by paying for an ML stage and a proxy to reach the input sources RTK cannot (native reads, RAG, history) and to make compression reversible. lean-ctx adds something real none of them have — a persistent, queryable code graph (the structural-retrieval lever) plus a verification layer that proves the saving — but to get there it carries a 64.7 MB binary, a daemon, a dashboard, several databases, a 77-tool schema, and host writes across dozens of agents: every cost the combining page predicted a monolith would inherit. So the choice is not "specialists vs the one that does it all"; it is layered specialists vs an integrated runtime — the lean stack when you want the cheapest cache-safe win, lean-ctx when you specifically want its code-graph/memory/verification surface and can carry the footprint. Either way, none of the four touches thinking, so none is a 10× story.

How to read this folder

PageWhat it answers
01 — Caveman designHow the output-register compressor works: the "prompt, not a program" magic, the six levels, the hooks, the cavecrew/cavemem/cavekit ecosystem.
02 — Headroom designHow the broad input pipeline works: the ContentRouter, the typed compressors, the kompress-base ML stage, the live-zone cache-stabilization magic, CCR, learn, the four deployment modes.
03 — RTK designHow the deterministic Bash-boundary compressor works: the six-phase lifecycle, the 12 strategies, the code filter, the hook modes, the reach limit.
04 — lean-ctx designHow the integrated context runtime works: the superset thesis, what it productizes from each of the other three, the code graph + RRF search the three lacked, CFT Φ-scoring, the verification/proof layer, and the monolith-tax footprint — with first-party build + benchmark numbers.
05 — Head-to-headThe feature has/lacks matrix (now four-way), the internals side-by-side, and the best case for each — where each beats the others.
06 — CombiningIs there one product? lean-ctx as the real test of the monolith thesis, the layered stack, the published head-to-head numbers, integrated-runtime-vs-stack by project shape, and the jackin' adoption order.
07 — Evidence and claimsBenchmarks, what is real vs self-report, the consolidated claim graveyard, adoption-stat caveats, and the validation harness.
08 — Records, ledger & unverifiedThe formal per-technique records (C1 / H1–H4 / R1 / L1), the full consolidated source ledger, and the unverified-claims register — vague or vendor-only numbers kept and marked "not proven" so nothing prior is lost.
09 — Gaps, open questions & next briefWhat this research still misses: uncompared axes, capabilities a fresh sweep surfaced, the unanswered questions, and a ready-to-paste /goal brief for the next round.
10 — First-party measurementsThe first measured (not self-reported) numbers: this repo's token decomposition (94% cache-read), RTK's reach ceiling at 16.5% of observation tokens vs native Read at 76.2%, and a locally built lean-ctx benchmark (96–99% on code reads, <10% on prose). Full 6-arm A/B still INCOMPLETE.
11 — Extended comparison axesThe six axes the earlier pages never compared: security/privacy/supply-chain, project health & sustainability, interaction with Claude Code native context features, build-vs-buy vs an output-style, subscriber tasks-per-cap (which inverts the $-per-task order), and non-coding generality.
12 — Implementation deep-diveThe first source-level pass: all four repos cloned and read, comparing rival implementations of each shared feature (shell compression, code outlining, cache-safety, memory, savings accounting, output register) with file:line evidence — whose code is better and why. Confirms lean-ctx out-engineers the specialists on most input features (tree-sitter vs regex, bounce-netted signed ledger vs SUM, measured cache-safety ratio) yet still has no output register, and logs the source-verified corrections to earlier pages.

On this page