# 02 — Headroom: design teardown (https://jackin.tailrocks.com/research/token-optimization-tools/02-headroom-design/)



# 02 — Headroom: design teardown [#02--headroom-design-teardown]

Headroom is the **broad input-side** member of the original trio, and — until [lean-ctx](/research/token-optimization-tools/04-leanctx-design/) joined this comparison — the only one that is a genuine runtime system rather than a prompt or a single filter binary. Where caveman is a markdown rule and RTK is a deterministic command filter, headroom is a Rust compression core with a content router, a fleet of typed compressors, a trained ML model, a reversible store, and a provider-aware proxy. That engineering depth is the reason it can reach token sources the other two cannot — and the reason it carries costs the other two do not.

| Field                 | Value                                                                                                                                          |
| --------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| Repository            | `chopratejas/headroom`                                                                                                                         |
| Pitch                 | "Compress tool outputs, logs, files, and RAG chunks before they reach the LLM. 60-95% fewer tokens, same answers. Library, proxy, MCP server." |
| Languages             | Python 78% (API/integrations), Rust 17.3% (`headroom-core`, `headroom-proxy` — the hot path), TypeScript 2.5%                                  |
| Companion model       | `chopratejas/kompress-base` on HuggingFace — a transformer trained on agentic traces, auto-downloaded                                          |
| Latest seen           | `v0.26.0` (\~190 PyPI releases — fast cadence)                                                                                                 |
| Adoption (2026-06-18) | 33,359★ / 111 watchers — PR-inflated; see [evidence](/research/token-optimization-tools/07-evidence-and-claims/)                               |
| License               | Apache-2.0                                                                                                                                     |
| Bucket hit            | Tool outputs / logs / RAG / files / history (the 61% cache lines)                                                                              |
| Cache interaction     | **Safe** in MCP/library mode; **risk** in whole-prompt proxy mode                                                                              |

## The magic: route each payload to a compressor that understands it, then stabilize the cache [#the-magic-route-each-payload-to-a-compressor-that-understands-it-then-stabilize-the-cache]

Headroom's central design idea is that &#x2A;*there is no single best way to compress an arbitrary payload — there is a best way per content type.** A request is a mix of logs, JSON, source code, search results, HTML, and conversation history, and each of those compresses well only under a transform that understands its shape. So headroom classifies each payload with a `ContentRouter` and dispatches it to a *typed* compressor, with a trained text model as the fallback for free-form prose. Then — and this is the part that distinguishes it from the LLMLingua family the dossier had previously written off — it does the compression in a way that **does not break prompt caching**, by stabilizing the cached prefix and compressing only the volatile tail.

```text
                       HEADROOM REQUEST PIPELINE

   incoming request (mixed payloads)
        │
        ▼
   ┌────────────────────┐   classify each payload by content type
   │   ContentRouter     │   (the dominant cost: ~11.7 ms, 91–98% of the
   │                     │    16.9 ms median pipeline)
   └────────────────────┘
        │
        ├──► LogCompressor ............ keep errors/levels, drop passing noise
        ├──► CodeAwareCompressor ...... keep imports/signatures/types, collapse bodies
        ├──► SearchCompressor ......... reduce to file:line:content
        ├──► SmartCrusher ............. JSON arrays → sampled/typed, keep anomalies
        ├──► HTMLCompressor ........... strip tag structure to content
        ├──► IntelligentContext ....... score msgs by recency/relevance/error,
        │                               drop low-value turns
        └──► TextCompressor / kompress-base ... ML perplexity-style prose
                                               compression  (the ONE ML stage)
        │
        ▼
   ┌──────────────────────────────────────────────┐
   │  cache_stabilization/  +  live_zone           │   THE CACHE-SAFETY MAGIC
   │   • volatile_detector.rs  (find the tail)     │   keep the prefix byte-
   │   • tool_def_normalize.rs (stabilize tools)   │   identical; compress only
   │   • anthropic_cache_control.rs (breakpoints)  │   the volatile live zone;
   │   • drift_detector.rs     (catch churn)       │   insert cache_control at
   │   • live_zone_anthropic.rs (compress tail)    │   stable boundaries
   └──────────────────────────────────────────────┘
        │
        ▼
   compressed request ──► provider (Anthropic / OpenAI / Bedrock / Gemini)
        │
        └──► originals stored in CCR  ──►  headroom_retrieve (reversible)
```

## The typed compressors — and what each one really is [#the-typed-compressors--and-what-each-one-really-is]

The most important finding for an operator deciding whether headroom is worth its cost is that **headroom is largely a productization of levers the dossier had already validated by hand** — usually with stronger (locally reproduced) evidence than headroom's own self-report. Headroom's value is not a new compression physics; it is packaging six proven transforms behind one router with cache-safety and reversibility.

| Headroom component                 | What it does                                              | The proven lever it productizes         | Strongest existing evidence                                  |
| ---------------------------------- | --------------------------------------------------------- | --------------------------------------- | ------------------------------------------------------------ |
| **LogCompressor**                  | Keep errors/stack traces/levels, drop passing noise       | Hook/preprocessing log filtering        | local &#x2A;*−94.2%** on a cargo log, all failures preserved |
| **CodeAwareCompressor**            | Keep imports/signatures/types, collapse bodies            | Repo-map / outline context              | local &#x2A;*−91%** outline vs whole-file read               |
| **SearchCompressor**               | `file:line:content`, drop verbose detail                  | Symbol-search retrieval                 | local &#x2A;*−98%** symbol-search vs file read               |
| **SmartCrusher**                   | JSON arrays → sampled/typed, keep anomalies               | TOON + JSON minification                | local &#x2A;*−34.3%*&#x2A; minify / &#x2A;*−41.2%** TOON     |
| **HTMLCompressor**                 | Strip tag structure to content                            | markdown-not-HTML, `max_content_tokens` | official pattern + Firecrawl 94%                             |
| **IntelligentContext**             | Score by recency/relevance/error, drop low-value messages | Context editing + compaction            | vendor −84%/+29% (search domain; unproven on code)           |
| **TextCompressor / kompress-base** | ML perplexity-style prose compression                     | LLMLingua family                        | T2 NL only — **the RISKY one for code**                      |

Two rows deserve caution flags, and they are the two that separate headroom from the deterministic RTK:

* **`TextCompressor` / `kompress-base`** is the lossy perplexity-style compressor wearing a trained-model coat. It is the component most likely to drop a load-bearing identifier or caveat, and it runs an auto-downloaded model on every request through the proxy. This is the one place in any of the three tools where an ML model sits in the hot path — which is both headroom's reach advantage (it can compress free-form prose the deterministic tools cannot) and its biggest risk surface.
* **`IntelligentContext`** is vendor-proven only on agentic *search*, never on code. An evicted tool result that turns out to matter 40 turns later is its silent failure.

## Why `kompress-base` works on hosted Claude when the "real" research compressors do not [#why-kompress-base-works-on-hosted-claude-when-the-real-research-compressors-do-not]

A subtle but decisive design choice: `kompress-base` compresses to **natural-language-ish text** that the hosted model reads normally. This is why it works on a hosted Claude API where the academically more impressive soft-prompt compressors (Gist, ICAE, 500xCompressor, xRAG, PISCO, Cartridges) cannot run at all — those compress into embeddings or KV state that the model must be *trained* to read, and no hosted API exposes that channel. The category insight headroom embodies: &#x2A;*on hosted APIs, only text-to-text compression is usable, and text-to-text compression is inherently lossy.** Headroom accepts that lossiness and mitigates it with reversibility (CCR) rather than pretending it away.

## The cache-safety machinery: live-zone compression [#the-cache-safety-machinery-live-zone-compression]

This is the part of headroom worth the deepest look, because it is the thing that refutes the dossier's earlier blanket verdict that "input compression breaks the cache."

The prior position was correct for *whole-prompt recompression&#x2A;: a compressor that rewrites the whole prompt every turn mutates the cached prefix and converts cheap 0.1× cache reads back into 1.25–2× cache writes. On the modeled day, such a compressor must clear **\~5.5× compression on a mixed prompt, \~10× on a fully-cacheable prefix, just to break even** — and a pre-registered 358-run Claude Sonnet 4.5 RCT (arXiv 2603.23525) found that *moderate* input compression cut cost 27.9% but *aggressive* input compression actually *raised* cost 1.8% (output expanded, and it did not even price the cache it also breaks).

Headroom does not do that. Its `cache_stabilization` subsystem and `live_zone_anthropic` compression split each request into a **stable prefix** and a **volatile live zone**, and compress only the live zone while keeping the prefix byte-identical. The evidence is in the Rust source, not just the marketing:

```text
   WHOLE-PROMPT RECOMPRESSION (the trap)      LIVE-ZONE COMPRESSION (headroom)
   ─────────────────────────────────────      ────────────────────────────────
   [ prefix | history | new obs ]             [ STABLE PREFIX (untouched) ]
        rewrite the WHOLE thing                    │  byte-identical → 0.1× reads survive
        every turn                                 │
        │                                     [ VOLATILE LIVE ZONE ]
        ▼                                          │  compress ONLY this, once,
   prefix bytes change                             ▼  before it is first cached
        │                                     cache_control breakpoint inserted
        ▼                                          at the stable boundary
   cache BUSTED → 0.1× reads                       │
   become 1.25–2× writes                           ▼
   (must beat ~5.5–10× to win)               cache PRESERVED; only the new
                                              observation's write+reads shrink
```

The subsystem — headroom names the prefix-stabilizing component **CacheAligner** ("extracts dynamic content and moves it to the end of the message, keeping the prefix stable… so the provider's KV cache can reuse previously computed attention states") — is concrete: `volatile_detector.rs` finds the tail, `tool_def_normalize.rs` stabilizes tool definitions, `anthropic_cache_control.rs` inserts breakpoints at stable boundaries, `drift_detector.rs` catches prefix churn, and a suite of `prefix_cache_benchmark.py` / `cache_bust_trace_report.py` tests actively guards against cache-bust regressions. In production this design measurably holds: one independent month-long deployment recorded a **96% prefix-cache-hit rate** while headroom was compressing.

The catch is that this story is only clean in **MCP and library mode**, and gets risky in **whole-prompt proxy mode in front of Claude Code**:

| Headroom mode                                  | Cache interaction on Claude Code                                                          | Verdict                                        |
| ---------------------------------------------- | ----------------------------------------------------------------------------------------- | ---------------------------------------------- |
| MCP (`headroom_compress` on observations)      | Compresses the tool output *before* it is cached; prefix untouched                        | **Cache-safe** — the recommended way to use it |
| Library (`compress()` on a payload pre-append) | Same as MCP; you control what gets compressed                                             | **Cache-safe**                                 |
| Agent wrapper (`headroom wrap claude`)         | Depends on whether it intercepts as a proxy                                               | **Audit before trusting**                      |
| Whole-prompt proxy in front of Claude Code     | Rewrites traffic Claude Code already caches; can churn the prefix; double-compaction risk | **Cache-risk — do not default**                |

The reasons proxy mode is risky in front of Claude Code: Claude Code *already* stabilizes its own prefix and places `cache_control` breakpoints, so a second stabilizer is redundant at best and can disagree at worst; a proxy that rewrites bodies can silently invalidate the exact prefix Claude Code intended to cache (you simply stop seeing `cache_read`); and Claude Code runs its own compaction, so headroom's independent `IntelligentContext` dropping can double-compact and evict content the client still expects.

## The genuinely new ideas: reversibility, cross-agent memory, failure-mining [#the-genuinely-new-ideas-reversibility-cross-agent-memory-failure-mining]

Beyond productizing known levers, headroom ships four ideas that are new relative to the dossier — and they are the features that have *no equivalent* in caveman or RTK:

* **H1 — Live-zone input compression.** Documented above: the cache-safe input-compression design point that the dossier had said barely existed. Refines the old "no compressor in the hot path" kill to "no *whole-prompt recompressor* in the hot path."
* **H2 — Reversible compression with on-demand retrieval (CCR).** Compressed content is stored verbatim in a CCR store (SQLite/Redis/in-memory backends); the model receives a compressed *view* plus a `headroom_retrieve` tool and can fetch the original within a TTL when it needs full detail. Lossy compression becomes *recoverable* lossy compression — which in principle removes the "confidently-wrong recalled fact" failure mode that makes lossy memory tools risky, *if* the model reliably knows when to retrieve. This is the single biggest architectural advantage headroom has over both RTK (tee on failure only) and caveman (no recovery at all).
* **H3 — Failure-mining into memory files (`headroom learn`).** Analyze past *failed* sessions across Claude/Codex/Gemini and write durable corrections into CLAUDE.md/AGENTS.md, so the always-loaded prefix improves over time instead of repeating mistakes. A closed self-correction loop with no equivalent anywhere else in the trio. Its risk: an auto-written rule that is wrong or over-general is one bad commit that can erase months of savings, so it demands a human gate.
* **H4 — Cross-agent deduplicated shared memory.** A single store shared across Claude, Codex, and Gemini with automatic dedup, so a fact learned in one agent is available once to the others instead of being re-derived per tool. Genuinely useful only for multi-tool operators — which is exactly the niche where it beats caveman's single-agent cavemem.

## Deployment modes [#deployment-modes]

Headroom is the most deployment-flexible of the three, and the flexibility is real surface area, not marketing:

* **library** — `compress(messages)` in your own code.
* **proxy** — `headroom proxy --port 8787`, rewrites all traffic (the risky mode in front of Claude Code).
* **agent wrapper** — `headroom wrap claude|codex|cursor|aider|copilot`.
* **MCP server** — exposes `headroom_compress` / `headroom_retrieve` / `headroom_stats` (the recommended, cache-safe mode).

It targets Anthropic, OpenAI, Bedrock (with SigV4), and Gemini, and integrates with LangChain, LiteLLM, Agno, Strands, the Vercel AI SDK, and the major coding agents.

## What headroom has, and what it lacks [#what-headroom-has-and-what-it-lacks]

| Feature                                                   | Headroom                                                                       |
| --------------------------------------------------------- | ------------------------------------------------------------------------------ |
| Compresses broad input (tool output, files, RAG, history) | **Yes — the only one of the three that reaches all of these**                  |
| Reaches native-tool reads (not just Bash)                 | **Yes** — acts on the API request, so it sees everything                       |
| Reversible / recoverable compression (CCR)                | **Yes — unique among the three**                                               |
| Cross-agent shared memory with dedup                      | **Yes — unique**                                                               |
| Failure-mining into memory files (`learn`)                | **Yes — unique**                                                               |
| Cache-safe input compression                              | **Yes, in MCP/library mode** (live-zone design)                                |
| Typed, content-aware compressors                          | **Yes** (7 compressors + ML)                                                   |
| Compresses output (what the model writes)                 | Partial — an optional output shaper, off by default; caveman is better at this |
| Touches thinking (20% of dollars)                         | **No**                                                                         |
| Deterministic / no ML in the loop                         | **No** — `kompress-base` is in the hot path                                    |
| Zero host effects                                         | **No** — fetches an ONNX runtime + model over TLS, runs local processes        |
| Cache-safe in proxy mode in front of Claude Code          | **No** — double-stabilization / cache-bust risk                                |
| Independent whole-session benchmark                       | Partial — one independent 47.5% measurement; the rest is vendor self-report    |

## Self-cost (measured, not guessed) [#self-cost-measured-not-guessed]

Headroom is the only one of the three with published latency telemetry, and it is candid: across 50k+ sessions (v0.5.18), proxy overhead is **P50 52 ms / P90 309 ms / P99 4,172 ms / mean 161 ms**; the internal pipeline runs **16.9 ms median**, of which the `ContentRouter` alone is &#x2A;*11.7 ms (91–98%)** (with `SmartCrusher` at \~50.1 ms and `TextCompressor&#x60; at \~32.0 ms on the payloads that actually hit them). A third party measured **+200–500 tokens** of passthrough metadata per request. On top of that: MCP schema rent in MCP mode, the auto-downloaded `kompress-base` model as a hot-path attack surface (a compressor in the request path is exactly the integrity boundary a "CompressionAttack" targets), and an offline/SSL asset to provision for sandboxed roles.

The failure modes follow from the machinery: the ML stage can drop an identifier on code; proxy mode can silently bust the cache; and `IntelligentContext` can double-compact against Claude Code — all *reversible via CCR if the model knows to retrieve*, which is the load-bearing "if."

## Evidence and the headline corrections [#evidence-and-the-headline-corrections]

Headroom's numbers are internally consistent and, unusually, honest about the easy-vs-hard split — but they are the maintainer's own, and two corrections matter:

* **"60–95% fewer tokens" is a per-payload ratio, not a whole-bill number.** Headroom's own benchmarks show it: repetitive logs/JSON compress 87–94%, but &#x2A;*grep results and source code compressed 0%** in the published v0.5.18 run ("code passes through to preserve correctness"). The representative mixed figure is &#x2A;*66.1%**. Its own production telemetry settles the whole-session reality: &#x2A;*median 4.8% / P75 6.9% / mean 11.3%**, reaching 40–80% only on heavy tool-use sessions. One independent deploy measured &#x2A;*47.5%** whole-session on a tool-heavy coding session (RAG prose 0%, logs 31%); an HN user reported "\~50%."
* **"96.2% total savings" double-counts caching Claude Code already banks.** That figure multiplies headroom's compression by prompt-caching's 90%-off — but Claude Code already runs maximally cached (the local heavy session measured 92.83% cache reads), so the 90%-off is the floor, not a marginal saving. Headroom's *incremental* lever on Claude Code is the compression fraction on the live zone alone.

Headroom's evidence tier is **T1 for the mechanisms** (the underlying log/outline/minify/search levers are locally reproduced and even academically backed for the write-time pattern) and **T3-weak for the specific product percentages** (vendor self-report plus one independent measurement). Its full benchmark tables, the H1–H4 records, and the headroom-specific claim graveyard live in the dossier's [headroom chapter](/research/token-optimization/53-headroom-and-context-compression/), with the surrounding market in the [compression-literature chapter](/research/token-optimization/54-context-compression-literature-and-market/).

***

Next: [03 — RTK design](/research/token-optimization-tools/03-rtk-design/), the deterministic mirror of this pipeline — same kinds of transform, no ML, one binary.
