# 53 — Headroom and the context-compression layer (vs the caveman ecosystem) (https://jackin.tailrocks.com/research/token-optimization/53-headroom-and-context-compression/)



# 53 — Headroom and the context-compression layer (vs the caveman ecosystem) [#53--headroom-and-the-context-compression-layer-vs-the-caveman-ecosystem]

Volume III deep-dive requested after the code-intelligence sweep (51) and the vector-database follow-up (52): analyze `chopratejas/headroom`, compare it in depth to the caveman ecosystem the operator already runs, re-sweep the internet for context-compression projects the dossier missed, and fold the result back into the dossier without editing the frozen Volume I/II files. Research conducted 2026-06-15; every external claim carries a source + access date in the ledger; headroom's product numbers are vendor self-reported and are tiered accordingly.

<Aside type="note">
  The **canonical, deepest cross-tool head-to-head** now lives in its own folder: [token-optimization tools](/research/token-optimization-tools/) — equal-depth design teardowns of caveman, headroom, RTK, and **lean-ctx**, a feature has/lacks matrix, best-case-of-each, and the combinability verdict. This chapter remains the full headroom deep-dive and source ledger that comparison points back to.
</Aside>

## TL;DR [#tldr]

* **Headroom is the input-side counterpart to caveman, not a competitor to it.** Caveman compresses what the model *writes* (visible prose, \~17% of heavy-session dollars); headroom compresses what the model *reads* (tool outputs, logs, RAG chunks, files, history — the content that rides the **29% cache-write + 32% cache-read lines = 61% of dollars**). They operate on different token classes, stack cleanly, and &#x2A;*neither touches thinking (20%)** — the dossier's largest unaddressed bucket stays unaddressed.
* **Headroom partially refutes the dossier's blanket "input compression breaks the cache" kill (record 19, file 46 FL3) — by design, not by magic.** Its Rust `cache_stabilization` subsystem (`anthropic_cache_control.rs`, `volatile_detector.rs`, `tool_def_normalize.rs`) plus **live-zone compression** (`live_zone_anthropic.rs`) compress only the volatile tail and keep the cached prefix byte-identical. This is **cache-safe in MCP/library mode** (compress an observation once, before it is ever cached) and **cache-risky in whole-prompt proxy mode** in front of an already-caching Claude Code.
* **The "60–95% fewer tokens" headline is a per-compressible-payload ratio, not a whole-bill number** — the same category error the dossier corrected for caveman (K1). Headroom's own benchmarks show it: repetitive logs/JSON compress 87–94%, but &#x2A;*grep results and source code compressed 0%** in the published v0.5.18 run ("code passes through to preserve correctness"). The honest whole-bill effect is `compressible-observation share × (write-share + 0.1×read-share)` of the 61% bucket — real, bounded, low-double-digit percent at best, not 60–95% of the bill.
* **"96.2% total savings" double-counts caching Claude Code already banks.** That figure multiplies headroom's compression by prompt-caching's 90%-off — but Claude Code already runs maximally cached (dossier K4: caching is the floor, not an available saving). Headroom's *incremental* lever on Claude Code is the compression fraction on the live zone alone.
* **Headroom's own production telemetry settles the headline: median whole-session compression is 4.8%.** Across 50,000+ proxy sessions / 250+ instances (Mar–Apr 2026) the vendor reports &#x2A;*median 4.8% / P75 6.9% / mean 11.3%** whole-session compression, reaching 40–80% only on heavy tool-use sessions; the limitations page says it outright — "Short conversational exchanges (median 4.8% compression)." Two independent hands-on deploys land in the tool-heavy band: Miya-Gadget (2026-06-03) measured 59,742→31,358 tokens (&#x2A;*47.5%**) with RAG prose compressed &#x2A;*0%** and logs only 31%, calling the "95%" claim "oversold"; an HN user reported "\~50%." The "60–95%" headline is the per-redundant-payload best case, not the whole-session reality. The headline survives only as a per-payload ratio on redundant JSON/logs (see the benchmarks section). The *mechanisms*, meanwhile, inherit **T1** from the dossier's own local reproductions (log-filter −94.2%, outline −91%, symbol-search −98%, JSON-minify −34.3%/−41.2%). Headroom productizes proven levers; it does not invent a new physics.
* **Two ideas in headroom are genuinely new to the dossier:** reversible compression with on-demand retrieval (CCR + `headroom_retrieve`) answers white-space #8 ("output brevity with quality gates"), and `headroom learn` (mine failed sessions → write corrections to CLAUDE.md/AGENTS.md) is a self-improving-memory lever not present anywhere in Volume I–III.
* **jackin' verdict:** pilot headroom's **MCP mode** as an A/B arm against the levers the dossier already recommends (hook filtering — record 20; code-intelligence outlines — file 51; serialization — record 14), measured on incremental tokens-per-solved-task. **Do not default to proxy mode** in a jackin' container: it adds a cache-bust risk, a per-request ML model in the hot path, a CompressionAttack surface (file 46 FL3), and a double-compaction conflict with Claude Code's own context management.

## What headroom is [#what-headroom-is]

| Field            | Value                                                                                                                                                                                                                                                                                                                                                                                                                      |
| ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Repository       | [github.com/chopratejas/headroom](https://github.com/chopratejas/headroom)                                                                                                                                                                                                                                                                                                                                                 |
| Pitch            | "Compress tool outputs, logs, files, and RAG chunks before they reach the LLM. 60-95% fewer tokens, same answers. Library, proxy, MCP server."                                                                                                                                                                                                                                                                             |
| Created / latest | 2026-01-07 / v0.25.0 (190 PyPI releases — fast cadence)                                                                                                                                                                                                                                                                                                                                                                    |
| Adoption         | 28,199 stars / **95 watchers** / 1,909 forks / 30 contributors / 268 issues (gh api 2026-06-15). &#x2A;*Stars are a PR artifact, not adoption:** \~87% landed in a 14-day window after a 2026-05-31 Register article + a Trendshift slot; the 95:28,199 watcher:star ratio is \~10× more skewed than a healthy repo, and the maintainer's next-most-starred repo has 25 stars. Treat the star count as noise (file 54 §A). |
| Languages        | Python 78% (API/integrations), Rust 17.3% (`headroom-core`, `headroom-proxy` — the hot path), TypeScript 2.5%                                                                                                                                                                                                                                                                                                              |
| License          | Apache-2.0 (permissive; contrast caveman's plugin-skill model)                                                                                                                                                                                                                                                                                                                                                             |
| Companion model  | `chopratejas/kompress-base` on HuggingFace — a transformer trained on agentic traces, auto-downloaded, used as the default text compressor                                                                                                                                                                                                                                                                                 |
| Deployment modes | **library** (`compress(messages)`), **proxy** (`headroom proxy --port 8787`, rewrites all traffic), **agent wrapper** (`headroom wrap claude\|codex\|cursor\|aider\|copilot`), **MCP server** (`headroom_compress` / `headroom_retrieve` / `headroom_stats`)                                                                                                                                                               |
| Targets          | Anthropic, OpenAI, Bedrock, Gemini; LangChain, LiteLLM, Agno, Strands, Vercel AI SDK; Claude Code, Codex, Cursor, Aider, Copilot CLI, OpenClaw                                                                                                                                                                                                                                                                             |

Headroom is a real engineering effort, not a prompt pack: a Rust core with a compression policy engine, a content router, typed transforms, a reversible store with SQLite/Redis/in-memory backends, and a proxy with provider-specific cache-stabilization and streaming (including Bedrock SigV4). That maturity is the reason it deserves a deeper treatment than a single market-scan row.

## The crux: does input compression conflict with prompt caching? [#the-crux-does-input-compression-conflict-with-prompt-caching]

This is the question that decides whether headroom belongs in the recommended stack or the graveyard, because the dossier's strongest standing verdict against input compression is exactly this conflict.

**The dossier's prior position.** Record 19 (LLMLingua family) and file 46 FL3 establish that a compressor in the request hot path that recompresses the whole prompt **fights*&#x2A; prompt caching: it mutates the stable prefix every turn, converting 0.1× cache reads back into 1.25–2× cache writes. On the modeled day a cache-breaking compressor must clear **\~5.5× compression (≈82%) just to break even*&#x2A; on a mixed prompt — and closer to **\~10× on a fully-cacheable prefix**, because cache reads cost 0.1× (`1.0·M = 0.1·N → N/M = 10`); 4× compression *loses money*. Two 2026 results harden this: "Don't Break the Cache" (arXiv 2601.06007) measures prompt caching saving 41–80% of agent cost across Anthropic/OpenAI/Google and prescribes putting dynamic content at the *end* of the system prompt (exactly what CacheAligner does), and the 358-run Claude Sonnet 4.5 RCT (arXiv 2603.23525) found aggressive input compression *raised* cost 1.8%. File 46 FL3 adds a security reason (CompressionAttack, arXiv 2510.22963, ≤80% attack-success-rate on prompt-compression modules). The blanket verdict was: keep compressors out of the hot path for coding.

**Headroom's answer, verified in source.** Headroom does not recompress the whole prompt. It splits the request into a stable prefix and a volatile **live zone**, and compresses only the live zone while keeping the prefix byte-identical. The evidence is in the code, not just the marketing:

* `crates/headroom-proxy/src/cache_stabilization/anthropic_cache_control.rs`, `volatile_detector.rs`, `tool_def_normalize.rs`, `drift_detector.rs` — a dedicated prefix-stabilization subsystem.
* `crates/headroom-proxy/src/compression/live_zone_anthropic.rs` (and OpenAI/Responses twins) — compression scoped to the live zone.
* `benchmarks/prefix_cache_benchmark.py`, `cache_bust_trace_report.py`, `synthetic_token_cache_bust_report.py`, `cache_validation_bundle.py` — they actively test for cache-bust regressions.

In headroom's own words (docs, accessed 2026-06-15): **CacheAligner** "solves this by extracting dynamic content and moving it to the end of the message, keeping the prefix stable," so "the prefix stays byte-identical across requests, so the provider's KV cache can reuse previously computed attention states." For Anthropic it "automatically inserts `cache_control` breakpoints at the right positions."

**What this changes in the dossier.** The record-19 / FL3 kill was correct for *whole-prompt recompression*, but it was stated too broadly. There is a cache-compatible design point, and headroom occupies it: &#x2A;*stabilize the prefix, compress only the volatile tail, and do it once before that tail is first cached.** A freshly-arriving tool output is going to be a cache *write* regardless; compressing it before it is written shrinks the write and every subsequent 0.1× read of it, *without* touching the already-cached prefix. This is the input-side analogue of the CAG pattern (file 46 FL1: compose with caching, don't fight it).

**But the mode matters, and the marketing blurs it.** The cache-safe story holds cleanly for **MCP mode** (the agent calls `headroom_compress` on an observation before it enters context) and **library mode** (compress a payload before you append it). It is much weaker for **whole-prompt proxy mode in front of Claude Code**, because:

1. Claude Code *already* stabilizes its own prefix and places `cache_control` breakpoints automatically. A second stabilizer in the path is redundant at best and can disagree with the client's breakpoints at worst.
2. A proxy that rewrites message bodies risks invalidating the exact prefix Claude Code intended to cache — the failure mode is silent (you simply stop seeing `cache_read`), and the 5.5× break-even applies the moment it happens.
3. Claude Code runs its own compaction/`/clear` hygiene; a proxy doing independent context dropping (`IntelligentContext`) can double-compact and evict content the client still expects.

| Headroom mode                                  | Cache interaction on Claude Code                                                          | Verdict                                        |
| ---------------------------------------------- | ----------------------------------------------------------------------------------------- | ---------------------------------------------- |
| MCP (`headroom_compress` on observations)      | Compresses the tool output *before* it is cached; prefix untouched                        | **Cache-safe** — the recommended way to use it |
| Library (`compress()` on a payload pre-append) | Same as MCP; you control what gets compressed                                             | **Cache-safe**                                 |
| Agent wrapper (`headroom wrap claude`)         | Depends on whether it intercepts as a proxy; needs JSONL audit                            | **Audit before trusting**                      |
| Whole-prompt proxy in front of Claude Code     | Rewrites traffic Claude Code already caches; can churn the prefix; double-compaction risk | **Cache-risk — do not default**                |

**The "96.2% total" double-count.** Headroom's docs advertise "96.2% total savings" by layering compression on top of provider caching. On a custom SDK app with no caching, that framing is fair. On Claude Code it is the dossier's K4 error: caching's \~90%-off is *already banked* (the local heavy session measured 92.83% cache reads), so it is not an available marginal saving. Headroom's honest incremental contribution on Claude Code is the **compression fraction applied to the live zone**, weighted by the write/read split — not 96.2%.

## Headroom's compressors map onto levers the dossier already validated [#headrooms-compressors-map-onto-levers-the-dossier-already-validated]

The most important finding for the dossier is that headroom is largely a *productization*, not a discovery. Each component lines up with an existing record, and the existing record usually carries stronger (locally-reproduced) evidence than headroom's self-report.

| Headroom component                 | What it does                                                       | Dossier lever it productizes                           | Strongest existing evidence                                                     |
| ---------------------------------- | ------------------------------------------------------------------ | ------------------------------------------------------ | ------------------------------------------------------------------------------- |
| **LogCompressor**                  | Keep errors/stack traces/levels, drop passing noise                | Record 20 (preprocessing/hooks filtering)              | **T1, local −94.2%** on a synthetic cargo log, all failures preserved (file 03) |
| **CodeAwareCompressor**            | Keep imports/signatures/types, collapse bodies                     | Record 16 (aider repo map) + file 12 + file 51         | **T1, local −91%** outline vs whole-file read (file 51)                         |
| **SearchCompressor**               | `file:line:content`, drop verbose detail                           | File 51 (codedb/fff symbol search)                     | **T1, local −98%** symbol-search vs file read (file 51)                         |
| **SmartCrusher**                   | JSON arrays → sampled/typed, keep anomalies                        | Record 14 (TOON + JSON minification)                   | **T1, local −34.3% minify / −41.2% TOON** vs minified (file 03)                 |
| **HTMLCompressor**                 | Strip tag structure to content                                     | Record 20 (markdown-not-HTML, `max_content_tokens`)    | T1 (official pattern) + Firecrawl 94% (T3)                                      |
| **IntelligentContext**             | Score by recency/relevance/error, drop low-value messages          | Record 12 (context editing) + record 06 (compaction)   | T1 vendor −84%/+29% (search domain; unproven on code)                           |
| **TextCompressor / kompress-base** | ML perplexity-style prose compression                              | Record 19 (LLMLingua family)                           | **T2 NL only; the RISKY one for code** — perplexity pruning drops identifiers   |
| **CacheAligner**                   | Extract volatile content, stabilize prefix, insert `cache_control` | Record 05/13 (cache hygiene) + file 46 FL1/FL2         | T1 (Anthropic caching mechanics)                                                |
| **CCR + `headroom_retrieve`**      | Store originals, retrieve on demand                                | White-space #8 + progressive disclosure (record 02/15) | New productization (see H2)                                                     |
| **Cross-agent memory**             | Shared, auto-dedup store across Claude/Codex/Gemini                | Record 02 (cavemem) + 15 (claude-mem) + file 45        | New cross-agent angle (see H4)                                                  |
| **`headroom learn`**               | Mine failed sessions → write CLAUDE.md/AGENTS.md fixes             | Nothing in Volume I–III                                | **Genuinely new** (see H3)                                                      |

The two rows worth caution: **TextCompressor/kompress-base** is the record-19 lossy compressor wearing a trained-model coat — it is the component most likely to drop a load-bearing identifier or caveat, and it runs an auto-downloaded model on every request through the proxy (latency + an attack surface). **IntelligentContext** is vendor-proven only on agentic *search*, never on code; an evicted tool result that mattered 40 turns later is the silent failure.

## Benchmarks: what is real and what is self-report [#benchmarks-what-is-real-and-what-is-self-report]

Headroom's numbers are internally consistent and, importantly, *honest about the easy-vs-hard split* — but they are all the maintainer's own, run on the maintainer's own harness, with no stated tokenizer and no third-party replication.

| Workload (headroom self-report)      | Before  | After   | Saved     | Honest reading                                                             |
| ------------------------------------ | ------- | ------- | --------- | -------------------------------------------------------------------------- |
| Code search (100 results)            | 17,765  | 1,408   | **92%**   | Repetitive search results — matches the dossier's −98% symbol-search lever |
| SRE incident debugging               | 65,694  | 5,118   | **92%**   | Logs — matches the dossier's −94.2% log filter                             |
| GitHub issue triage                  | 54,174  | 14,761  | **73%**   | Mixed text + metadata                                                      |
| Codebase exploration                 | 78,502  | 41,254  | **47%**   | Code-heavy — the hard case, and the number drops by half                   |
| 6-content-type mix                   | 23,921  | 8,110   | **66.1%** | The most representative single figure                                      |
| v0.5.18: build log (200 lines)       | 2,412 B | 148 B   | **\~94%** | Repetitive                                                                 |
| v0.5.18: grep results (150 hits)     | 2,624 B | 2,624 B | **0%**    | Pass-through                                                               |
| v0.5.18: Python source (\~480 lines) | 2,958 B | 2,958 B | **0%**    | "code passes through to preserve correctness"                              |

Accuracy (headroom self-report, 100-sample tests): GSM8K 0.870 → 0.870 (±0.000); TruthfulQA 0.530 → 0.560 (+0.030); SQuAD v2 97% at &#x2A;*19%** compression; BFCL tools 97% at &#x2A;*32%** compression; HTML extraction F1 0.919 (recall 0.982) at 94.9% compression on a structured benchmark. The pattern is the dossier's thesis restated by the vendor: &#x2A;*accuracy is preserved at low compression on prose/QA, and high compression is only safe on highly-repetitive content.** The headline "same answers" is true in the regime where the content was redundant to begin with; it is untested at high compression on code and reasoning.

**Headroom's own production telemetry is the most decisive number, and it is the vendor's.** Across 50,000+ proxy sessions / 250+ instances (Mar–Apr 2026), headroom's benchmarks page reports &#x2A;*median 4.8% / P75 6.9% / mean 11.3%** whole-session compression, rising to 40–80% only on heavy tool-use sessions; the limitations page states it plainly — "Short conversational exchanges (median 4.8% compression)." So the vendor itself measures the typical whole-session effect at single digits. That is the per-payload-vs-whole-bill split in the maker's own production data: the 60–95% headline is the best case on redundant JSON/logs, not what a representative session sees.

**Independent measurement corroborates on the tool-heavy end.** A third party (Miya-Gadget, 2026-06-03) deployed headroom on a real coding session and measured **59,742 → 31,358 tokens = 47.5% overall**, broken down as code 79.8%, JSON 59.2%, logs 31.0%, and &#x2A;*RAG/prose 0.0% (untouched by default)** — concluding the "95% token reduction" marketing "feels oversold," with realistic sessions at \~20–30% and 80%+ only on high-redundancy JSON/logs. An HN user independently reported "\~50%." A tool-heavy coding session sits in the 40–80% band; the whole-traffic median sits at 4.8%. The press that drove headroom's visibility — a 2026-05-31 Register piece repeating a vendor "$700K saved / 90% redundant" figure, echoed by \~20 downstream outlets — ran no independent test (file 54 §A).

The whole-bill correction (the K1 move, applied to the input side): "60–95%" is a per-payload ratio on compressible observations. On the modeled heavy day, tool outputs/observations are only part of the 61% cache traffic (the rest is the system prefix, CLAUDE.md, conversation history, and code reads that headroom passes through at \~0%). The realistic whole-bill effect is `(compressible-observation share of the 61%) × compression% × (write-share + 0.1×read-share)`. With most observation tokens already living at the 0.1× read price after first write, the read-side win is worth a tenth of its face value — so even an aggressive deployment lands in the low double digits of the day's dollars, not 60–95% of the bill. That is still a real lever on the *largest* bucket; it is simply not the headline number.

## How headroom compares to caveman and RTK [#how-headroom-compares-to-caveman-and-rtk]

The full side-by-side comparison — headroom vs the caveman ecosystem vs RTK, the axis-by-axis table, the cache-safety asymmetry (output brevity is cache-neutral; input compression is cache-breaking unless done at write-time on new content), the family-overlap mapping, and the memory either/or — is consolidated as a **single source of truth** in the dedicated folder, not duplicated here:

* [Token-optimization tools — overview + master comparison table](/research/token-optimization-tools/)
* [Head-to-head: the feature has/lacks matrix and best-case-of-each](/research/token-optimization-tools/05-head-to-head/)
* [Combining: the layered stack, the ecosystem overlap, and the memory either/or](/research/token-optimization-tools/06-combining/)

The one structural point worth restating here because it drives headroom's whole design: **output brevity is cache-neutral, but input compression is cache-breaking unless it is done at write-time on new content** — which is exactly why headroom's live-zone / MCP path (compress a new observation before it is first cached) is cache-safe while its whole-prompt proxy in front of an already-caching Claude is not. The rest of this chapter is the dossier's **headroom record**: its typed compressors, the live-zone cache machinery, the H1–H4 technique records, the benchmarks, and the source ledger.

## Genuinely new techniques (per-technique records) [#genuinely-new-techniques-per-technique-records]

These use the §10 record schema. Levers headroom merely productizes (log filtering, outlines, minification, context editing) are already recorded in Volume I–III and are not repeated here.

### H1. Live-zone input compression — the cache-safe design point record 19 said did not exist [#h1-live-zone-input-compression--the-cache-safe-design-point-record-19-said-did-not-exist]

* **Coverage-delta:** Refines record 19 + file 46 FL3. Volume I/II treated input compression as monolithically cache-hostile; this records the cache-compatible sub-design.
* **Layer:** input + cache.
* **Mechanism:** split each request into a stable prefix and a volatile live zone; stabilize the prefix (extract volatile content to a tail, normalize tool definitions, insert `cache_control` at stable boundaries) and compress only the live zone, once, before it is first cached. The cached prefix stays byte-identical, so 0.1× reads survive; the compression shrinks the cache *write* of the new content and all future reads of it.
* **Expected savings:** on the modeled day, `(compressible-observation share of the 61% cache bucket) × compression% × (write-share + 0.1×read-share)`. Real on the largest bucket, bounded to low-double-digit % of dollars because most observation tokens already read at 0.1×. NOT the 60–95% per-payload headline (ESTIMATE; arithmetic in the benchmarks section).
* **Evidence tier:** T1 for the mechanism (the underlying log/outline/minify levers are locally reproduced in files 03/51); **T3-weak for headroom's specific product numbers** (vendor self-report, no independent replication); **T2 academic backing for the write-time pattern itself** — Squeez (arXiv 2604.04979 — 92% tool-output token removal at 0.86 recall, run as a write-time Unix pipe, cache-safe, code-domain) and AgentDiet (arXiv 2509.23586 — Claude 4 Sonnet 64.5%→66.5% with input −40–60%, the only paper in the class that nets out the compressor's own +5–15% cost, the net-accounting white-space #5 demanded).
* **Quality risk:** **NEUTRAL on rule-based transforms** (log/JSON/search/diff), **RISKY on the ML text compressor** (kompress-base can drop identifiers/caveats — the record-19 failure mode), **RISKY in proxy mode** (silent cache-bust if the prefix churns). Falsify by A/B on JSONL: confirm `cache_read` continuity is preserved and tokens-per-solved-task drops net of overhead.
* **Availability:** `CLAUDE-CODE-TODAY` via MCP (`headroom_compress`) / `SDK` (library) / `GATEWAY-OR-SELF-HOST` (proxy).
* **Effort to adopt:** minutes (MCP) to hours (proxy + offline asset provisioning).
* **Composability:** composes *with* prompt caching (unlike record 19's LLMLingua) when scoped to the live zone; anti-synergy with proxy-mode-in-front-of-Claude-Code (double-stabilization) and with anything that mutates the prefix.
* **Validation protocol:** 20 tool-heavy tasks, native vs headroom-MCP; from JSONL require (a) `cache_read` ratio unchanged or better, (b) tool-result tokens down, (c) task success unchanged, (d) net tokens-per-solved-task down ≥20% after subtracting MCP schema + retrieve round-trips.

### H2. Reversible compression with on-demand retrieval (CCR) [#h2-reversible-compression-with-on-demand-retrieval-ccr]

* **Coverage-delta:** New productization of white-space #8 ("output brevity with quality gates") and the progressive-disclosure idea behind record 02/15.
* **Layer:** input / retrieval.
* **Mechanism:** compressed content is stored verbatim in a CCR store (SQLite/Redis/in-memory backends in `headroom-core`); the model receives a compressed view plus a `headroom_retrieve` tool and can fetch the original within a TTL when it needs full detail. Lossy compression becomes *recoverable* lossy compression.
* **Expected savings:** the compression saving of H1, minus the cost of retrievals actually triggered. Net-positive only if retrieval rate is low; each retrieve is a tool-call round-trip (schema + request + the original payload re-entering context).
* **Evidence tier:** T3 (mechanism shipped and benchmarked by the vendor: `ccr_regression_benchmark.py`, `adversarial_ccr_tests.py`); no independent measurement of net effect.
* **Quality risk:** **NEGATIVE-COST in principle** (it removes the lossy-memory failure mode that makes cavemem/claude-mem RISKY) — *if* the model reliably knows when to retrieve. Failure mode: the model trusts a compressed view it should have expanded, or over-retrieves and erases the saving. Falsify by seeding tasks whose answer hinges on a detail that compression dropped; measure retrieve recall and net tokens.
* **Availability:** `CLAUDE-CODE-TODAY` (MCP exposes `headroom_retrieve`).
* **Effort to adopt:** minutes (MCP); the store needs a backend choice for persistence.
* **Composability:** strengthens any lossy input/memory compressor; pairs with cross-agent memory (H4); orthogonal to caching.
* **Validation protocol:** detail-dependent canary suite (numbers, negations, "don't do X" buried in a compressed payload); require retrieve-or-correct behavior on 10/10 and net-positive tokens.

### H3. Failure-mining into memory files (`headroom learn`) [#h3-failure-mining-into-memory-files-headroom-learn]

* **Coverage-delta:** New — no equivalent in Volume I–III.
* **Layer:** input (memory) / meta.
* **Mechanism:** analyze past *failed* sessions across Claude/Codex/Gemini and write durable corrections into CLAUDE.md/AGENTS.md, so the always-loaded prefix improves over time instead of repeating mistakes. A closed self-correction loop over the memory file the dossier already prices.
* **Expected savings:** indirect — fewer repeated failures = fewer wasted retry turns (the most expensive waste, since retries pay full thinking + output). No published number; the *cost* is added prefix mass (record 07 rent: every CLAUDE.md line is cache-read rent on every call) and a risk of bloating the file past the "under 200 lines" guidance.
* **Evidence tier:** T4 (plausible mechanism, no measured net effect; failure-mining quality unverified).
* **Quality risk:** **RISKY** — an auto-written rule that is wrong or over-general is one bad PR that erases months of savings (record 07's failure mode), and unbounded auto-append violates CLAUDE.md slimming. Falsify by reviewing every auto-written rule before commit and replaying the rule-sensitive task set.
* **Availability:** `CLAUDE-CODE-TODAY` (CLI command).
* **Effort to adopt:** minutes to run; ongoing editorial discipline to keep the file lean.
* **Composability:** feeds record 07 (CLAUDE.md) and the jackin' `[token_policy]` idea (file 32); anti-synergy with prefix slimming if left unbounded.
* **Validation protocol:** human-gate every correction; cap the file size; A/B the failure rate on the task class the correction targets, and confirm the added prefix rent is smaller than the retries it prevents.

### H4. Cross-agent deduplicated shared memory [#h4-cross-agent-deduplicated-shared-memory]

* **Coverage-delta:** Extends record 02 (cavemem) / record 15 (claude-mem) / file 45 (cross-agent portability) with a cross-tool, auto-dedup angle none of them cover.
* **Layer:** input (memory).
* **Mechanism:** a single store shared across Claude, Codex, and Gemini, with automatic deduplication, so a fact learned in one agent is available (once) to the others instead of being re-derived per tool.
* **Expected savings:** unquantified by the vendor; the dossier's standing objection to all memory tools applies — no injection-cost-vs-re-exploration-saved accounting exists (white-space #5).
* **Evidence tier:** T4 (no net-accounting published, here or upstream).
* **Quality risk:** **RISKY** — the cavemem/claude-mem failure mode (stale or wrong recalled facts mislead a session) plus a cross-agent blast radius (a bad memory now corrupts three tools). Reversibility (H2) mitigates but does not remove it. Falsify by quizzing the store against source transcripts and auditing currency.
* **Availability:** `CLAUDE-CODE-TODAY` (MCP/library), genuinely useful only for multi-tool operators.
* **Effort to adopt:** minutes–hours (persistent store).
* **Composability:** competes with cavemem/claude-mem (pick one); pairs with CCR (H2) for recoverable recall.
* **Validation protocol:** the week-long memory A/B from record 02, run across two agents, metering the store's own compression/injection calls against re-exploration avoided.

## Market delta — other context-compression projects (internet re-sweep) [#market-delta--other-context-compression-projects-internet-re-sweep]

A clean-room sweep for compression-layer projects the dossier's 03/46/51/52 do not already cover. Code-intelligence retrievers (codedb, fff, Serena, Code Context Engine, Claude Context, Sourcegraph, Augment, Qodo) are in file 51; vector backends are in file 52; this section is the *compression/proxy/memory* layer specifically. Numbers are vendor self-report unless marked; none is locally reproduced here.

| Project                                             | Category                          | What it compresses                      | Claimed saving                    | Works with Claude Code?  | Tier                                             |
| --------------------------------------------------- | --------------------------------- | --------------------------------------- | --------------------------------- | ------------------------ | ------------------------------------------------ |
| **headroom** (`chopratejas/headroom`)               | Compression library + proxy + MCP | Tool outputs, logs, RAG, files, history | 60–95% (per-payload); 66.1% mixed | Yes (MCP/library/proxy)  | T3-weak                                          |
| **LLMLingua / LLMLingua-2** (`microsoft/LLMLingua`) | Prompt-compression proxy          | Whole prompt (perplexity pruning)       | up to 20× (NL)                    | Self-host; cache-hostile | T2 (record 19)                                   |
| **CompactPrompt**                                   | Prompt compression guide/lib      | Prune + abbreviate + quantize data      | "up to 60%"                       | Self-host                | T4 (file 46)                                     |
| **claude-mem** (`thedotmack/claude-mem`)            | Cross-session memory              | Compressed memory observations          | "\~10×" retrieval-path            | Yes                      | T3 (record 15)                                   |
| **cavemem** (`JuliusBrussee/cavemem`)               | Compressed memory MCP             | caveman-compressed memory               | "\~75% prose"                     | Yes                      | T4 (record 02)                                   |
| **Mem0**                                            | Agent memory layer                | Extracted/compressed memories           | vendor benchmarks                 | Yes (API)                | T4 (dossier K-mem: files-only beat it on LoCoMo) |

This list is the focused compression/memory layer; the broader retrieval and serialization market is covered in files 51, 52, and record 14. The pending internet re-sweep (parallel research streams) augments this table with any additional 2025–2026 compression proxies, observation compressors, or cross-agent memory systems that survive the skeptic pass; load-bearing additions will be merged here with their sources before this file's verdict is treated as final.

The standing pattern from the dossier holds: &#x2A;*the compression-layer market crowds the buckets that are easy to demo (prose, memory, repetitive logs) and self-reports per-payload ratios as if they were whole-bill numbers.** Headroom is the most serious engineering in the category and the only one with a credible cache-safe design, but it shares the category's two weaknesses — no independent net-accounting, and a hot-path security/latency cost.

## Fresh-literature delta [#fresh-literature-delta]

Headroom is a direct test of the literature trends file 46 already tracked, and it sharpens two of them:

* **Prefix-stable / cache-aware compression is now shipped, not just hypothesized.** File 46 framed CAG (preload-and-cache, FL1) and the LLMLingua cache-conflict (FL3) as the two poles. Headroom's live-zone design is the missing middle: compress the variable tail while preserving the cached prefix. This does not overturn FL3 (whole-prompt recompression is still cache-hostile and still an attack surface) — it bounds it to the proxy-recompression case.
* **The security axis (FL3 / CompressionAttack) applies to headroom directly.** A compressor in the request path — especially the auto-downloaded `kompress-base` model in proxy mode — is exactly the integrity boundary CompressionAttack (arXiv 2510.22963, ≤80% ASR) targets. This is a concrete reason to prefer MCP mode (compress specific, agent-chosen observations) over a transparent proxy that compresses everything.
* **The soft-prompt / learned-compression family (Gist, ICAE, 500xCompressor, LTSC) remains self-host-only** for a hosted Claude operator (file 46 D): a frontier hosted model cannot read meta-tokens it was not trained on. Headroom's kompress-base is *not* in this family — it compresses to natural-language-ish text the hosted model reads normally, which is why it works on hosted Claude where the soft-prompt methods cannot. That is the category insight: on hosted APIs, only *text-to-text* compression is usable, and it is inherently lossy.

The parallel literature re-sweep (running) will extend this with any 2025–2026 work specifically on cache-aware or reversible compression and any independent benchmark of headroom; findings that change a verdict will be folded in with sources.

## Corrections and refinements to prior files [#corrections-and-refinements-to-prior-files]

* **Refine record 19 / file 46 FL3.** Restate the kill precisely: *whole-prompt* recompression in the hot path fights caching and is an attack surface; *live-zone* compression that stabilizes the prefix and compresses only the volatile tail is cache-compatible. Headroom is the worked example. The recommendation "no compressor in the hot path" becomes "no *whole-prompt recompressor* in the hot path; live-zone/observation compression is acceptable when it preserves `cache_read` continuity, prefers MCP/library over transparent proxy, and is measured net of its own overhead."
* **Refine file 46 D ("no new lossy compressor both user-reachable on hosted Claude and safe for code").** Superseded by 2026 code-domain results that run as preprocessing on hosted models and *raise* SWE-bench accuracy: SWEzze/OCD (arXiv 2603.28119 — AST-aware, \~6×, resolution +5.0–9.2% on SWE-bench Verified), SWE-Pruner (arXiv 2601.16746 — Claude Sonnet 4.5 70.6%→72.0%, tokens −23–38%), LongCodeZip (ASE 2025 — training-free, 5.6× with no loss on code). Corrected verdict: do not compress code *inside the cached prefix* (query-conditional code compression is cache-breaking per instance), but compressing code/tool-output *at ingestion of new content* is now an evidence-backed, accuracy-neutral-or-positive lever. The Perplexity Paradox (arXiv 2602.15843) explains why naive LLMLingua still fails on code — 86.1% of its failures are NameError from dropped function identities, recovered by deterministic signature injection (+34pp) — which is precisely headroom's CodeAwareCompressor design (keep signatures, drop bodies).
* **Refine white-space #8.** "Output brevity with quality gates" now has a shipped input-side analogue: reversible compression (CCR) is the quality gate — the model can recover what compression dropped. The white-space item is partially filled on the input side; the output side (compress *generation* only when a verifier confirms zero loss) is still open.
* **Correct the caveman evidence citation (file 03 record 01 / K1).** The repo-cited "arXiv 2604.00025, brevity improved accuracy +26 points" is now verified to *exist* (file 03 flagged it unverified) — but it is a single unaffiliated author, unreviewed, its +26.3pp is on a cherry-picked 7.7% subset where verbose large models self-sabotage, and it tests no Claude model and no code task. Keep the citation; treat it as suggestive NL-only and do **not** propagate "+26 points" as transferable. The defensible "brevity can improve accuracy" evidence is Chain-of-Draft on Claude 3.5 Sonnet (arXiv 2502.18600: +4.1pp on the sports task at −92% output) — a modest single-digit effect, consistent with the dossier's existing caveman read.
* **No change to the 10× verdict.** Headroom attacks the 61% cache bucket, which is the right target, but its realistic whole-bill effect is bounded (per the K1-style correction) and it does not touch thinking (20%). The dossier's verdict stands: ≈2.5× defensible, ≈5–6.2× with validated routing, no honest 10× at zero quality loss. Headroom is a strong *addition to the Aggressive stack's input layer*, not a new multiplier that breaks the wall.

## jackin' adoption recommendation [#jackin-adoption-recommendation]

Headroom fits the same role-scoped, opt-in, measured-locally pattern file 51 set for code-intelligence tools, with one extra guardrail because it sits closer to the model.

1. **Pilot MCP mode, not proxy mode.** Register `headroom` as an MCP server inside a role container (user scope), expose `headroom_compress` / `headroom_retrieve`, and have the agent compress large tool outputs/observations on demand. This is the cache-safe path and keeps the compression auditable per call.
2. **Never default the whole-prompt proxy in a jackin' container.** It risks busting the cache Claude Code already manages, double-compacts against Claude Code's own context management, puts an auto-downloaded model in the hot path (latency + an offline/SSL-inspection asset to provision), and creates a CompressionAttack surface. If the proxy is evaluated at all, it must be an explicit, isolated experiment with cache-read continuity checked in JSONL.
3. **A/B against the levers the dossier already banks**, not against a naive baseline: hook filtering (record 20), code-intelligence outlines (file 51), and serialization (record 14) already capture most of headroom's compressible wins, cache-safely and with no extra dependency. Headroom earns its place only if it beats that stack net of MCP schema rent and retrieve round-trips.
4. **Make host effects explicit.** Headroom fetches the ONNX runtime and kompress-base over TLS and runs local processes; per the host-write ban, install and cache assets inside the container, and pre-provision the model for offline/sandboxed roles.
5. **Choose one memory layer.** If adopting headroom memory, retire cavemem for that workflow (running both is pure overhead); keep the choice explicit and measured.

### Validation harness [#validation-harness]

Run the same shape as file 51, with cache continuity added as a first-class metric:

| Arm            | Tools allowed                                                    |
| -------------- | ---------------------------------------------------------------- |
| Native         | Claude Code defaults (hooks, Edit-diffs, deferred MCP)           |
| Hooks          | Native + record-20 grep/markdown filtering                       |
| Code-intel     | Native + file-51 outline/symbol retrieval                        |
| Headroom-MCP   | Native + `headroom_compress`/`headroom_retrieve` on observations |
| Headroom-proxy | Native behind the headroom proxy (cache-continuity watch)        |

Metrics: tool-result tokens; **`cache_read` ratio and cache-write spikes from JSONL** (the make-or-break for any input compressor); retrieve count and retrieve token cost; total tokens per solved task; task success and test pass; wall-clock; MCP schema tokens loaded per turn.

Acceptance rule:

```text
Accept headroom for token optimization only if, versus the Hooks+Code-intel arm:
  task/test success            >= baseline
  cache_read ratio             >= baseline (no silent cache-bust)
  total tokens per solved task <= baseline by at least 20%
  net of MCP schema rent and headroom_retrieve round-trips
```

Per the dossier's standing rule: a per-payload compression ratio is not a banked saving until it survives this harness on jackin' tasks at equal quality.

## Claims to kill (headroom-specific graveyard) [#claims-to-kill-headroom-specific-graveyard]

| #    | Claim in the wild                                            | Verdict and corrected reading                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| ---- | ------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| H-K1 | "headroom cuts 60–95% of your tokens"                        | Per-compressible-payload ratio, not whole-bill. Repetitive logs/JSON hit 87–94%; code and grep compressed &#x2A;*0%** in v0.5.18 ("passes through to preserve correctness"); the representative mix is 66.1%. **Headroom's own production telemetry: median 4.8% / P75 6.9% / mean 11.3% whole-session** across 50k+ proxy sessions, 40–80% only on heavy tool-use. **Independently measured at 47.5% whole-session** on a tool-heavy coding session (Miya-Gadget, 2026-06-03; RAG prose 0%, logs 31%) and "\~50%" (HN). Whole-bill effect = compressible-observation share × compression × (write + 0.1×read) — low-double-digit % of dollars, same category as the caveman K1 correction. |
| H-K2 | "96.2% total savings on Anthropic"                           | Double-counts caching Claude Code already banks (K4). Caching's 90%-off is the floor, not a marginal saving; headroom's incremental lever on Claude Code is the live-zone compression fraction only.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| H-K3 | "Input compression breaks the cache, so headroom can't help" | Too broad. Whole-prompt recompression breaks the cache (record 19 holds); headroom's live-zone design stabilizes the prefix and compresses only the volatile tail, which is cache-compatible in MCP/library mode. The kill is the *proxy-in-front-of-Claude-Code* case, not headroom as a whole.                                                                                                                                                                                                                                                                                                                                                                                            |
| H-K4 | "Same answers" (lossless)                                    | Lossless only at low compression on prose/QA, and on rule-based transforms. The ML text compressor and high-compression code paths are lossy; "same answers" is unverified at high compression on code and untested on thinking. Reversibility (CCR) mitigates *if* the model retrieves when it should.                                                                                                                                                                                                                                                                                                                                                                                     |
| H-K5 | "50–90%" (PyPI) vs "60–95%" (README)                         | The project's own headline range is inconsistent across surfaces — a sign the number is a marketing band, not a measured constant. Treat any single percentage as directional and measure locally.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| H-K6 | "Drop it in as a proxy, zero code changes, free win"         | In front of Claude Code the proxy is a cache-bust risk, a double-compaction risk, a hot-path latency cost, and an attack surface (FL3). "Zero code changes" is true; "free" is not.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |

## Source ledger [#source-ledger]

All accessed 2026-06-15.

> The **complete consolidated ledger** for all three tools together — plus the formal per-technique records and the unverified-claims register — is maintained in the hub: [Records, ledger & unverified](/research/token-optimization-tools/08-records-ledger-and-unverified/). The chapter-specific citations are retained below as the original research record.

* headroom repo + README: [github.com/chopratejas/headroom](https://github.com/chopratejas/headroom)
* headroom stats (28,185★ / 1,908 forks / 30 contributors / 268 issues / Apache-2.0 / created 2026-01-07 / v0.25.0): `gh api repos/chopratejas/headroom`
* headroom source tree (cache\_stabilization, live\_zone, ccr, transforms, benchmarks): `gh api repos/chopratejas/headroom/git/trees/main?recursive=1`; `cache_control` in 62 files (`gh api search/code`)
* headroom docs (intro, how-compression-works, proxy, cache-optimization, architecture): [headroom-docs.vercel.app/docs](https://headroom-docs.vercel.app/docs) and the repo's `docs/content/docs/*.mdx`
* CacheAligner verbatim ("extracting dynamic content and moving it to the end... prefix stays byte-identical... KV cache can reuse"; auto `cache_control`; "96.2% total savings"): headroom `docs/content/docs/cache-optimization.mdx`
* benchmark numbers (code search 17,765→1,408; SRE 65,694→5,118; triage 54,174→14,761; exploration 78,502→41,254; mix 23,921→8,110; v0.5.18 grep/Python 0%; GSM8K/TruthfulQA/SQuAD/BFCL accuracy): headroom README + `docs/benchmarks.md`
* kompress-base model (transformer trained on agentic traces, auto-downloaded default text compressor): [huggingface.co/chopratejas/kompress-base](https://huggingface.co/chopratejas/kompress-base)
* PyPI (190 releases, v0.25.0, requires-python ≥3.10, summary "Cut costs by 50-90%"): [pypi.org/project/headroom-ai](https://pypi.org/project/headroom-ai/)
* secondary write-ups (tutorial/promotional, all repeating maintainer numbers; explicit "measure on your own workloads" caveat): subratpati.medium.com; alphamatch.ai/blog/headroom-context-compression-ai-agents-2026; andrew\.ooo/posts/headroom-context-compression-llm-agents-review; dev.to/arshtechpro
* cross-references (caveman/cavemem/cavekit/cavecrew records, K1/K4, white-space map): [`03-prior-art-and-market-scan.md`](/research/token-optimization/03-prior-art-and-market-scan/)
* cache-conflict + CompressionAttack + CAG/FL1/FL3: [`46-fresh-literature-and-market-delta.md`](/research/token-optimization/46-fresh-literature-and-market-delta/)
* log-filter −94.2% / TOON −41.2% / minify −34.3% local reproductions: [`03-prior-art-and-market-scan.md`](/research/token-optimization/03-prior-art-and-market-scan/)
* outline −91% / symbol-search −98% local reproductions: [`51-code-intelligence-tools.md`](/research/token-optimization/51-code-intelligence-tools/)
* fresh-literature sources (write-time compression, code-domain SWE-bench gains, cache break-even, output-brevity dominance, context rot) and the compression-tool market sweep: [`54-context-compression-literature-and-market.md`](/research/token-optimization/54-context-compression-literature-and-market/). Key inline IDs: Squeez arXiv 2604.04979; AgentDiet arXiv 2509.23586; "Don't Break the Cache" arXiv 2601.06007; Claude 4.5 compression RCT arXiv 2603.23525; SWEzze arXiv 2603.28119; SWE-Pruner arXiv 2601.16746; Perplexity Paradox arXiv 2602.15843; Chain-of-Draft arXiv 2502.18600; brevity-hierarchy arXiv 2604.00025.
