02 — Headroom: design teardown

Headroom is the broad input-side member of the original trio, and — until lean-ctx joined this comparison — the only one that is a genuine runtime system rather than a prompt or a single filter binary. Where caveman is a markdown rule and RTK is a deterministic command filter, headroom is a Rust compression core with a content router, a fleet of typed compressors, a trained ML model, a reversible store, and a provider-aware proxy. That engineering depth is the reason it can reach token sources the other two cannot — and the reason it carries costs the other two do not.

Field	Value
Repository	`chopratejas/headroom`
Pitch	"Compress tool outputs, logs, files, and RAG chunks before they reach the LLM. 60-95% fewer tokens, same answers. Library, proxy, MCP server."
Languages	Python 78% (API/integrations), Rust 17.3% (`headroom-core`, `headroom-proxy` — the hot path), TypeScript 2.5%
Companion model	`chopratejas/kompress-base` on HuggingFace — a transformer trained on agentic traces, auto-downloaded
Latest seen	`v0.26.0` (~190 PyPI releases — fast cadence)
Adoption (2026-06-18)	33,359★ / 111 watchers — PR-inflated; see evidence
License	Apache-2.0
Bucket hit	Tool outputs / logs / RAG / files / history (the 61% cache lines)
Cache interaction	Safe in MCP/library mode; risk in whole-prompt proxy mode

The magic: route each payload to a compressor that understands it, then stabilize the cache

Headroom's central design idea is that there is no single best way to compress an arbitrary payload — there is a best way per content type. A request is a mix of logs, JSON, source code, search results, HTML, and conversation history, and each of those compresses well only under a transform that understands its shape. So headroom classifies each payload with a ContentRouter and dispatches it to a typed compressor, with a trained text model as the fallback for free-form prose. Then — and this is the part that distinguishes it from the LLMLingua family the dossier had previously written off — it does the compression in a way that does not break prompt caching, by stabilizing the cached prefix and compressing only the volatile tail.

                       HEADROOM REQUEST PIPELINE

   incoming request (mixed payloads)
        │
        ▼
   ┌────────────────────┐   classify each payload by content type
   │   ContentRouter     │   (the dominant cost: ~11.7 ms, 91–98% of the
   │                     │    16.9 ms median pipeline)
   └────────────────────┘
        │
        ├──► LogCompressor ............ keep errors/levels, drop passing noise
        ├──► CodeAwareCompressor ...... keep imports/signatures/types, collapse bodies
        ├──► SearchCompressor ......... reduce to file:line:content
        ├──► SmartCrusher ............. JSON arrays → sampled/typed, keep anomalies
        ├──► HTMLCompressor ........... strip tag structure to content
        ├──► IntelligentContext ....... score msgs by recency/relevance/error,
        │                               drop low-value turns
        └──► TextCompressor / kompress-base ... ML perplexity-style prose
                                               compression  (the ONE ML stage)
        │
        ▼
   ┌──────────────────────────────────────────────┐
   │  cache_stabilization/  +  live_zone           │   THE CACHE-SAFETY MAGIC
   │   • volatile_detector.rs  (find the tail)     │   keep the prefix byte-
   │   • tool_def_normalize.rs (stabilize tools)   │   identical; compress only
   │   • anthropic_cache_control.rs (breakpoints)  │   the volatile live zone;
   │   • drift_detector.rs     (catch churn)       │   insert cache_control at
   │   • live_zone_anthropic.rs (compress tail)    │   stable boundaries
   └──────────────────────────────────────────────┘
        │
        ▼
   compressed request ──► provider (Anthropic / OpenAI / Bedrock / Gemini)
        │
        └──► originals stored in CCR  ──►  headroom_retrieve (reversible)

The typed compressors — and what each one really is

The most important finding for an operator deciding whether headroom is worth its cost is that headroom is largely a productization of levers the dossier had already validated by hand — usually with stronger (locally reproduced) evidence than headroom's own self-report. Headroom's value is not a new compression physics; it is packaging six proven transforms behind one router with cache-safety and reversibility.

Headroom component	What it does	The proven lever it productizes	Strongest existing evidence
LogCompressor	Keep errors/stack traces/levels, drop passing noise	Hook/preprocessing log filtering	local −94.2% on a cargo log, all failures preserved
CodeAwareCompressor	Keep imports/signatures/types, collapse bodies	Repo-map / outline context	local −91% outline vs whole-file read
SearchCompressor	`file:line:content`, drop verbose detail	Symbol-search retrieval	local −98% symbol-search vs file read
SmartCrusher	JSON arrays → sampled/typed, keep anomalies	TOON + JSON minification	local −34.3% minify / −41.2% TOON
HTMLCompressor	Strip tag structure to content	markdown-not-HTML, `max_content_tokens`	official pattern + Firecrawl 94%
IntelligentContext	Score by recency/relevance/error, drop low-value messages	Context editing + compaction	vendor −84%/+29% (search domain; unproven on code)
TextCompressor / kompress-base	ML perplexity-style prose compression	LLMLingua family	T2 NL only — the RISKY one for code

Two rows deserve caution flags, and they are the two that separate headroom from the deterministic RTK:

TextCompressor / kompress-base is the lossy perplexity-style compressor wearing a trained-model coat. It is the component most likely to drop a load-bearing identifier or caveat, and it runs an auto-downloaded model on every request through the proxy. This is the one place in any of the three tools where an ML model sits in the hot path — which is both headroom's reach advantage (it can compress free-form prose the deterministic tools cannot) and its biggest risk surface.
IntelligentContext is vendor-proven only on agentic search, never on code. An evicted tool result that turns out to matter 40 turns later is its silent failure.

Why `kompress-base` works on hosted Claude when the "real" research compressors do not

A subtle but decisive design choice: kompress-base compresses to natural-language-ish text that the hosted model reads normally. This is why it works on a hosted Claude API where the academically more impressive soft-prompt compressors (Gist, ICAE, 500xCompressor, xRAG, PISCO, Cartridges) cannot run at all — those compress into embeddings or KV state that the model must be trained to read, and no hosted API exposes that channel. The category insight headroom embodies: on hosted APIs, only text-to-text compression is usable, and text-to-text compression is inherently lossy. Headroom accepts that lossiness and mitigates it with reversibility (CCR) rather than pretending it away.

The cache-safety machinery: live-zone compression

This is the part of headroom worth the deepest look, because it is the thing that refutes the dossier's earlier blanket verdict that "input compression breaks the cache."

The prior position was correct for whole-prompt recompression: a compressor that rewrites the whole prompt every turn mutates the cached prefix and converts cheap 0.1× cache reads back into 1.25–2× cache writes. On the modeled day, such a compressor must clear ~5.5× compression on a mixed prompt, ~10× on a fully-cacheable prefix, just to break even — and a pre-registered 358-run Claude Sonnet 4.5 RCT (arXiv 2603.23525) found that moderate input compression cut cost 27.9% but aggressive input compression actually raised cost 1.8% (output expanded, and it did not even price the cache it also breaks).

Headroom does not do that. Its cache_stabilization subsystem and live_zone_anthropic compression split each request into a stable prefix and a volatile live zone, and compress only the live zone while keeping the prefix byte-identical. The evidence is in the Rust source, not just the marketing:

   WHOLE-PROMPT RECOMPRESSION (the trap)      LIVE-ZONE COMPRESSION (headroom)
   ─────────────────────────────────────      ────────────────────────────────
   [ prefix | history | new obs ]             [ STABLE PREFIX (untouched) ]
        rewrite the WHOLE thing                    │  byte-identical → 0.1× reads survive
        every turn                                 │
        │                                     [ VOLATILE LIVE ZONE ]
        ▼                                          │  compress ONLY this, once,
   prefix bytes change                             ▼  before it is first cached
        │                                     cache_control breakpoint inserted
        ▼                                          at the stable boundary
   cache BUSTED → 0.1× reads                       │
   become 1.25–2× writes                           ▼
   (must beat ~5.5–10× to win)               cache PRESERVED; only the new
                                              observation's write+reads shrink

The subsystem — headroom names the prefix-stabilizing component CacheAligner ("extracts dynamic content and moves it to the end of the message, keeping the prefix stable… so the provider's KV cache can reuse previously computed attention states") — is concrete: volatile_detector.rs finds the tail, tool_def_normalize.rs stabilizes tool definitions, anthropic_cache_control.rs inserts breakpoints at stable boundaries, drift_detector.rs catches prefix churn, and a suite of prefix_cache_benchmark.py / cache_bust_trace_report.py tests actively guards against cache-bust regressions. In production this design measurably holds: one independent month-long deployment recorded a 96% prefix-cache-hit rate while headroom was compressing.

The catch is that this story is only clean in MCP and library mode, and gets risky in whole-prompt proxy mode in front of Claude Code:

Headroom mode	Cache interaction on Claude Code	Verdict
MCP (`headroom_compress` on observations)	Compresses the tool output before it is cached; prefix untouched	Cache-safe — the recommended way to use it
Library (`compress()` on a payload pre-append)	Same as MCP; you control what gets compressed	Cache-safe
Agent wrapper (`headroom wrap claude`)	Depends on whether it intercepts as a proxy	Audit before trusting
Whole-prompt proxy in front of Claude Code	Rewrites traffic Claude Code already caches; can churn the prefix; double-compaction risk	Cache-risk — do not default

The reasons proxy mode is risky in front of Claude Code: Claude Code already stabilizes its own prefix and places cache_control breakpoints, so a second stabilizer is redundant at best and can disagree at worst; a proxy that rewrites bodies can silently invalidate the exact prefix Claude Code intended to cache (you simply stop seeing cache_read); and Claude Code runs its own compaction, so headroom's independent IntelligentContext dropping can double-compact and evict content the client still expects.

The genuinely new ideas: reversibility, cross-agent memory, failure-mining

Beyond productizing known levers, headroom ships four ideas that are new relative to the dossier — and they are the features that have no equivalent in caveman or RTK:

H1 — Live-zone input compression. Documented above: the cache-safe input-compression design point that the dossier had said barely existed. Refines the old "no compressor in the hot path" kill to "no whole-prompt recompressor in the hot path."
H2 — Reversible compression with on-demand retrieval (CCR). Compressed content is stored verbatim in a CCR store (SQLite/Redis/in-memory backends); the model receives a compressed view plus a headroom_retrieve tool and can fetch the original within a TTL when it needs full detail. Lossy compression becomes recoverable lossy compression — which in principle removes the "confidently-wrong recalled fact" failure mode that makes lossy memory tools risky, if the model reliably knows when to retrieve. This is the single biggest architectural advantage headroom has over both RTK (tee on failure only) and caveman (no recovery at all).
H3 — Failure-mining into memory files (headroom learn). Analyze past failed sessions across Claude/Codex/Gemini and write durable corrections into CLAUDE.md/AGENTS.md, so the always-loaded prefix improves over time instead of repeating mistakes. A closed self-correction loop with no equivalent anywhere else in the trio. Its risk: an auto-written rule that is wrong or over-general is one bad commit that can erase months of savings, so it demands a human gate.
H4 — Cross-agent deduplicated shared memory. A single store shared across Claude, Codex, and Gemini with automatic dedup, so a fact learned in one agent is available once to the others instead of being re-derived per tool. Genuinely useful only for multi-tool operators — which is exactly the niche where it beats caveman's single-agent cavemem.

Deployment modes

Headroom is the most deployment-flexible of the three, and the flexibility is real surface area, not marketing:

library — compress(messages) in your own code.
proxy — headroom proxy --port 8787, rewrites all traffic (the risky mode in front of Claude Code).
agent wrapper — headroom wrap claude|codex|cursor|aider|copilot.
MCP server — exposes headroom_compress / headroom_retrieve / headroom_stats (the recommended, cache-safe mode).

It targets Anthropic, OpenAI, Bedrock (with SigV4), and Gemini, and integrates with LangChain, LiteLLM, Agno, Strands, the Vercel AI SDK, and the major coding agents.

What headroom has, and what it lacks

Feature	Headroom
Compresses broad input (tool output, files, RAG, history)	Yes — the only one of the three that reaches all of these
Reaches native-tool reads (not just Bash)	Yes — acts on the API request, so it sees everything
Reversible / recoverable compression (CCR)	Yes — unique among the three
Cross-agent shared memory with dedup	Yes — unique
Failure-mining into memory files (`learn`)	Yes — unique
Cache-safe input compression	Yes, in MCP/library mode (live-zone design)
Typed, content-aware compressors	Yes (7 compressors + ML)
Compresses output (what the model writes)	Partial — an optional output shaper, off by default; caveman is better at this
Touches thinking (20% of dollars)	No
Deterministic / no ML in the loop	No — `kompress-base` is in the hot path
Zero host effects	No — fetches an ONNX runtime + model over TLS, runs local processes
Cache-safe in proxy mode in front of Claude Code	No — double-stabilization / cache-bust risk
Independent whole-session benchmark	Partial — one independent 47.5% measurement; the rest is vendor self-report

Self-cost (measured, not guessed)

Headroom is the only one of the three with published latency telemetry, and it is candid: across 50k+ sessions (v0.5.18), proxy overhead is P50 52 ms / P90 309 ms / P99 4,172 ms / mean 161 ms; the internal pipeline runs 16.9 ms median, of which the ContentRouter alone is 11.7 ms (91–98%) (with SmartCrusher at ~50.1 ms and TextCompressor at ~32.0 ms on the payloads that actually hit them). A third party measured +200–500 tokens of passthrough metadata per request. On top of that: MCP schema rent in MCP mode, the auto-downloaded kompress-base model as a hot-path attack surface (a compressor in the request path is exactly the integrity boundary a "CompressionAttack" targets), and an offline/SSL asset to provision for sandboxed roles.

The failure modes follow from the machinery: the ML stage can drop an identifier on code; proxy mode can silently bust the cache; and IntelligentContext can double-compact against Claude Code — all reversible via CCR if the model knows to retrieve, which is the load-bearing "if."

Evidence and the headline corrections

Headroom's numbers are internally consistent and, unusually, honest about the easy-vs-hard split — but they are the maintainer's own, and two corrections matter:

"60–95% fewer tokens" is a per-payload ratio, not a whole-bill number. Headroom's own benchmarks show it: repetitive logs/JSON compress 87–94%, but grep results and source code compressed 0% in the published v0.5.18 run ("code passes through to preserve correctness"). The representative mixed figure is 66.1%. Its own production telemetry settles the whole-session reality: median 4.8% / P75 6.9% / mean 11.3%, reaching 40–80% only on heavy tool-use sessions. One independent deploy measured 47.5% whole-session on a tool-heavy coding session (RAG prose 0%, logs 31%); an HN user reported "~50%."
"96.2% total savings" double-counts caching Claude Code already banks. That figure multiplies headroom's compression by prompt-caching's 90%-off — but Claude Code already runs maximally cached (the local heavy session measured 92.83% cache reads), so the 90%-off is the floor, not a marginal saving. Headroom's incremental lever on Claude Code is the compression fraction on the live zone alone.

Headroom's evidence tier is T1 for the mechanisms (the underlying log/outline/minify/search levers are locally reproduced and even academically backed for the write-time pattern) and T3-weak for the specific product percentages (vendor self-report plus one independent measurement). Its full benchmark tables, the H1–H4 records, and the headroom-specific claim graveyard live in the dossier's headroom chapter, with the surrounding market in the compression-literature chapter.

Next: 03 — RTK design, the deterministic mirror of this pipeline — same kinds of transform, no ML, one binary.

02 — Headroom: design teardown

On this page