53 — Headroom and the context-compression layer (vs the caveman ecosystem)
53 — Headroom and the context-compression layer (vs the caveman ecosystem)
Volume III deep-dive requested after the code-intelligence sweep (51) and the vector-database follow-up (52): analyze chopratejas/headroom, compare it in depth to the caveman ecosystem the operator already runs, re-sweep the internet for context-compression projects the dossier missed, and fold the result back into the dossier without editing the frozen Volume I/II files. Research conducted 2026-06-15; every external claim carries a source + access date in the ledger; headroom's product numbers are vendor self-reported and are tiered accordingly.
The canonical, deepest cross-tool head-to-head now lives in its own folder: token-optimization tools — equal-depth design teardowns of caveman, headroom, RTK, and lean-ctx, a feature has/lacks matrix, best-case-of-each, and the combinability verdict. This chapter remains the full headroom deep-dive and source ledger that comparison points back to.
TL;DR
- Headroom is the input-side counterpart to caveman, not a competitor to it. Caveman compresses what the model writes (visible prose, ~17% of heavy-session dollars); headroom compresses what the model reads (tool outputs, logs, RAG chunks, files, history — the content that rides the 29% cache-write + 32% cache-read lines = 61% of dollars). They operate on different token classes, stack cleanly, and neither touches thinking (20%) — the dossier's largest unaddressed bucket stays unaddressed.
- Headroom partially refutes the dossier's blanket "input compression breaks the cache" kill (record 19, file 46 FL3) — by design, not by magic. Its Rust
cache_stabilizationsubsystem (anthropic_cache_control.rs,volatile_detector.rs,tool_def_normalize.rs) plus live-zone compression (live_zone_anthropic.rs) compress only the volatile tail and keep the cached prefix byte-identical. This is cache-safe in MCP/library mode (compress an observation once, before it is ever cached) and cache-risky in whole-prompt proxy mode in front of an already-caching Claude Code. - The "60–95% fewer tokens" headline is a per-compressible-payload ratio, not a whole-bill number — the same category error the dossier corrected for caveman (K1). Headroom's own benchmarks show it: repetitive logs/JSON compress 87–94%, but grep results and source code compressed 0% in the published v0.5.18 run ("code passes through to preserve correctness"). The honest whole-bill effect is
compressible-observation share × (write-share + 0.1×read-share)of the 61% bucket — real, bounded, low-double-digit percent at best, not 60–95% of the bill. - "96.2% total savings" double-counts caching Claude Code already banks. That figure multiplies headroom's compression by prompt-caching's 90%-off — but Claude Code already runs maximally cached (dossier K4: caching is the floor, not an available saving). Headroom's incremental lever on Claude Code is the compression fraction on the live zone alone.
- Headroom's own production telemetry settles the headline: median whole-session compression is 4.8%. Across 50,000+ proxy sessions / 250+ instances (Mar–Apr 2026) the vendor reports median 4.8% / P75 6.9% / mean 11.3% whole-session compression, reaching 40–80% only on heavy tool-use sessions; the limitations page says it outright — "Short conversational exchanges (median 4.8% compression)." Two independent hands-on deploys land in the tool-heavy band: Miya-Gadget (2026-06-03) measured 59,742→31,358 tokens (47.5%) with RAG prose compressed 0% and logs only 31%, calling the "95%" claim "oversold"; an HN user reported "~50%." The "60–95%" headline is the per-redundant-payload best case, not the whole-session reality. The headline survives only as a per-payload ratio on redundant JSON/logs (see the benchmarks section). The mechanisms, meanwhile, inherit T1 from the dossier's own local reproductions (log-filter −94.2%, outline −91%, symbol-search −98%, JSON-minify −34.3%/−41.2%). Headroom productizes proven levers; it does not invent a new physics.
- Two ideas in headroom are genuinely new to the dossier: reversible compression with on-demand retrieval (CCR +
headroom_retrieve) answers white-space #8 ("output brevity with quality gates"), andheadroom learn(mine failed sessions → write corrections to CLAUDE.md/AGENTS.md) is a self-improving-memory lever not present anywhere in Volume I–III. - jackin' verdict: pilot headroom's MCP mode as an A/B arm against the levers the dossier already recommends (hook filtering — record 20; code-intelligence outlines — file 51; serialization — record 14), measured on incremental tokens-per-solved-task. Do not default to proxy mode in a jackin' container: it adds a cache-bust risk, a per-request ML model in the hot path, a CompressionAttack surface (file 46 FL3), and a double-compaction conflict with Claude Code's own context management.
What headroom is
| Field | Value |
|---|---|
| Repository | github.com/chopratejas/headroom |
| Pitch | "Compress tool outputs, logs, files, and RAG chunks before they reach the LLM. 60-95% fewer tokens, same answers. Library, proxy, MCP server." |
| Created / latest | 2026-01-07 / v0.25.0 (190 PyPI releases — fast cadence) |
| Adoption | 28,199 stars / 95 watchers / 1,909 forks / 30 contributors / 268 issues (gh api 2026-06-15). Stars are a PR artifact, not adoption: ~87% landed in a 14-day window after a 2026-05-31 Register article + a Trendshift slot; the 95:28,199 watcher:star ratio is ~10× more skewed than a healthy repo, and the maintainer's next-most-starred repo has 25 stars. Treat the star count as noise (file 54 §A). |
| Languages | Python 78% (API/integrations), Rust 17.3% (headroom-core, headroom-proxy — the hot path), TypeScript 2.5% |
| License | Apache-2.0 (permissive; contrast caveman's plugin-skill model) |
| Companion model | chopratejas/kompress-base on HuggingFace — a transformer trained on agentic traces, auto-downloaded, used as the default text compressor |
| Deployment modes | library (compress(messages)), proxy (headroom proxy --port 8787, rewrites all traffic), agent wrapper (headroom wrap claude|codex|cursor|aider|copilot), MCP server (headroom_compress / headroom_retrieve / headroom_stats) |
| Targets | Anthropic, OpenAI, Bedrock, Gemini; LangChain, LiteLLM, Agno, Strands, Vercel AI SDK; Claude Code, Codex, Cursor, Aider, Copilot CLI, OpenClaw |
Headroom is a real engineering effort, not a prompt pack: a Rust core with a compression policy engine, a content router, typed transforms, a reversible store with SQLite/Redis/in-memory backends, and a proxy with provider-specific cache-stabilization and streaming (including Bedrock SigV4). That maturity is the reason it deserves a deeper treatment than a single market-scan row.
The crux: does input compression conflict with prompt caching?
This is the question that decides whether headroom belongs in the recommended stack or the graveyard, because the dossier's strongest standing verdict against input compression is exactly this conflict.
The dossier's prior position. Record 19 (LLMLingua family) and file 46 FL3 establish that a compressor in the request hot path that recompresses the whole prompt fights prompt caching: it mutates the stable prefix every turn, converting 0.1× cache reads back into 1.25–2× cache writes. On the modeled day a cache-breaking compressor must clear ~5.5× compression (≈82%) just to break even on a mixed prompt — and closer to ~10× on a fully-cacheable prefix, because cache reads cost 0.1× (1.0·M = 0.1·N → N/M = 10); 4× compression loses money. Two 2026 results harden this: "Don't Break the Cache" (arXiv 2601.06007) measures prompt caching saving 41–80% of agent cost across Anthropic/OpenAI/Google and prescribes putting dynamic content at the end of the system prompt (exactly what CacheAligner does), and the 358-run Claude Sonnet 4.5 RCT (arXiv 2603.23525) found aggressive input compression raised cost 1.8%. File 46 FL3 adds a security reason (CompressionAttack, arXiv 2510.22963, ≤80% attack-success-rate on prompt-compression modules). The blanket verdict was: keep compressors out of the hot path for coding.
Headroom's answer, verified in source. Headroom does not recompress the whole prompt. It splits the request into a stable prefix and a volatile live zone, and compresses only the live zone while keeping the prefix byte-identical. The evidence is in the code, not just the marketing:
crates/headroom-proxy/src/cache_stabilization/anthropic_cache_control.rs,volatile_detector.rs,tool_def_normalize.rs,drift_detector.rs— a dedicated prefix-stabilization subsystem.crates/headroom-proxy/src/compression/live_zone_anthropic.rs(and OpenAI/Responses twins) — compression scoped to the live zone.benchmarks/prefix_cache_benchmark.py,cache_bust_trace_report.py,synthetic_token_cache_bust_report.py,cache_validation_bundle.py— they actively test for cache-bust regressions.
In headroom's own words (docs, accessed 2026-06-15): CacheAligner "solves this by extracting dynamic content and moving it to the end of the message, keeping the prefix stable," so "the prefix stays byte-identical across requests, so the provider's KV cache can reuse previously computed attention states." For Anthropic it "automatically inserts cache_control breakpoints at the right positions."
What this changes in the dossier. The record-19 / FL3 kill was correct for whole-prompt recompression, but it was stated too broadly. There is a cache-compatible design point, and headroom occupies it: stabilize the prefix, compress only the volatile tail, and do it once before that tail is first cached. A freshly-arriving tool output is going to be a cache write regardless; compressing it before it is written shrinks the write and every subsequent 0.1× read of it, without touching the already-cached prefix. This is the input-side analogue of the CAG pattern (file 46 FL1: compose with caching, don't fight it).
But the mode matters, and the marketing blurs it. The cache-safe story holds cleanly for MCP mode (the agent calls headroom_compress on an observation before it enters context) and library mode (compress a payload before you append it). It is much weaker for whole-prompt proxy mode in front of Claude Code, because:
- Claude Code already stabilizes its own prefix and places
cache_controlbreakpoints automatically. A second stabilizer in the path is redundant at best and can disagree with the client's breakpoints at worst. - A proxy that rewrites message bodies risks invalidating the exact prefix Claude Code intended to cache — the failure mode is silent (you simply stop seeing
cache_read), and the 5.5× break-even applies the moment it happens. - Claude Code runs its own compaction/
/clearhygiene; a proxy doing independent context dropping (IntelligentContext) can double-compact and evict content the client still expects.
| Headroom mode | Cache interaction on Claude Code | Verdict |
|---|---|---|
MCP (headroom_compress on observations) | Compresses the tool output before it is cached; prefix untouched | Cache-safe — the recommended way to use it |
Library (compress() on a payload pre-append) | Same as MCP; you control what gets compressed | Cache-safe |
Agent wrapper (headroom wrap claude) | Depends on whether it intercepts as a proxy; needs JSONL audit | Audit before trusting |
| Whole-prompt proxy in front of Claude Code | Rewrites traffic Claude Code already caches; can churn the prefix; double-compaction risk | Cache-risk — do not default |
The "96.2% total" double-count. Headroom's docs advertise "96.2% total savings" by layering compression on top of provider caching. On a custom SDK app with no caching, that framing is fair. On Claude Code it is the dossier's K4 error: caching's ~90%-off is already banked (the local heavy session measured 92.83% cache reads), so it is not an available marginal saving. Headroom's honest incremental contribution on Claude Code is the compression fraction applied to the live zone, weighted by the write/read split — not 96.2%.
Headroom's compressors map onto levers the dossier already validated
The most important finding for the dossier is that headroom is largely a productization, not a discovery. Each component lines up with an existing record, and the existing record usually carries stronger (locally-reproduced) evidence than headroom's self-report.
| Headroom component | What it does | Dossier lever it productizes | Strongest existing evidence |
|---|---|---|---|
| LogCompressor | Keep errors/stack traces/levels, drop passing noise | Record 20 (preprocessing/hooks filtering) | T1, local −94.2% on a synthetic cargo log, all failures preserved (file 03) |
| CodeAwareCompressor | Keep imports/signatures/types, collapse bodies | Record 16 (aider repo map) + file 12 + file 51 | T1, local −91% outline vs whole-file read (file 51) |
| SearchCompressor | file:line:content, drop verbose detail | File 51 (codedb/fff symbol search) | T1, local −98% symbol-search vs file read (file 51) |
| SmartCrusher | JSON arrays → sampled/typed, keep anomalies | Record 14 (TOON + JSON minification) | T1, local −34.3% minify / −41.2% TOON vs minified (file 03) |
| HTMLCompressor | Strip tag structure to content | Record 20 (markdown-not-HTML, max_content_tokens) | T1 (official pattern) + Firecrawl 94% (T3) |
| IntelligentContext | Score by recency/relevance/error, drop low-value messages | Record 12 (context editing) + record 06 (compaction) | T1 vendor −84%/+29% (search domain; unproven on code) |
| TextCompressor / kompress-base | ML perplexity-style prose compression | Record 19 (LLMLingua family) | T2 NL only; the RISKY one for code — perplexity pruning drops identifiers |
| CacheAligner | Extract volatile content, stabilize prefix, insert cache_control | Record 05/13 (cache hygiene) + file 46 FL1/FL2 | T1 (Anthropic caching mechanics) |
CCR + headroom_retrieve | Store originals, retrieve on demand | White-space #8 + progressive disclosure (record 02/15) | New productization (see H2) |
| Cross-agent memory | Shared, auto-dedup store across Claude/Codex/Gemini | Record 02 (cavemem) + 15 (claude-mem) + file 45 | New cross-agent angle (see H4) |
headroom learn | Mine failed sessions → write CLAUDE.md/AGENTS.md fixes | Nothing in Volume I–III | Genuinely new (see H3) |
The two rows worth caution: TextCompressor/kompress-base is the record-19 lossy compressor wearing a trained-model coat — it is the component most likely to drop a load-bearing identifier or caveat, and it runs an auto-downloaded model on every request through the proxy (latency + an attack surface). IntelligentContext is vendor-proven only on agentic search, never on code; an evicted tool result that mattered 40 turns later is the silent failure.
Benchmarks: what is real and what is self-report
Headroom's numbers are internally consistent and, importantly, honest about the easy-vs-hard split — but they are all the maintainer's own, run on the maintainer's own harness, with no stated tokenizer and no third-party replication.
| Workload (headroom self-report) | Before | After | Saved | Honest reading |
|---|---|---|---|---|
| Code search (100 results) | 17,765 | 1,408 | 92% | Repetitive search results — matches the dossier's −98% symbol-search lever |
| SRE incident debugging | 65,694 | 5,118 | 92% | Logs — matches the dossier's −94.2% log filter |
| GitHub issue triage | 54,174 | 14,761 | 73% | Mixed text + metadata |
| Codebase exploration | 78,502 | 41,254 | 47% | Code-heavy — the hard case, and the number drops by half |
| 6-content-type mix | 23,921 | 8,110 | 66.1% | The most representative single figure |
| v0.5.18: build log (200 lines) | 2,412 B | 148 B | ~94% | Repetitive |
| v0.5.18: grep results (150 hits) | 2,624 B | 2,624 B | 0% | Pass-through |
| v0.5.18: Python source (~480 lines) | 2,958 B | 2,958 B | 0% | "code passes through to preserve correctness" |
Accuracy (headroom self-report, 100-sample tests): GSM8K 0.870 → 0.870 (±0.000); TruthfulQA 0.530 → 0.560 (+0.030); SQuAD v2 97% at 19% compression; BFCL tools 97% at 32% compression; HTML extraction F1 0.919 (recall 0.982) at 94.9% compression on a structured benchmark. The pattern is the dossier's thesis restated by the vendor: accuracy is preserved at low compression on prose/QA, and high compression is only safe on highly-repetitive content. The headline "same answers" is true in the regime where the content was redundant to begin with; it is untested at high compression on code and reasoning.
Headroom's own production telemetry is the most decisive number, and it is the vendor's. Across 50,000+ proxy sessions / 250+ instances (Mar–Apr 2026), headroom's benchmarks page reports median 4.8% / P75 6.9% / mean 11.3% whole-session compression, rising to 40–80% only on heavy tool-use sessions; the limitations page states it plainly — "Short conversational exchanges (median 4.8% compression)." So the vendor itself measures the typical whole-session effect at single digits. That is the per-payload-vs-whole-bill split in the maker's own production data: the 60–95% headline is the best case on redundant JSON/logs, not what a representative session sees.
Independent measurement corroborates on the tool-heavy end. A third party (Miya-Gadget, 2026-06-03) deployed headroom on a real coding session and measured 59,742 → 31,358 tokens = 47.5% overall, broken down as code 79.8%, JSON 59.2%, logs 31.0%, and RAG/prose 0.0% (untouched by default) — concluding the "95% token reduction" marketing "feels oversold," with realistic sessions at ~20–30% and 80%+ only on high-redundancy JSON/logs. An HN user independently reported "~50%." A tool-heavy coding session sits in the 40–80% band; the whole-traffic median sits at 4.8%. The press that drove headroom's visibility — a 2026-05-31 Register piece repeating a vendor "$700K saved / 90% redundant" figure, echoed by ~20 downstream outlets — ran no independent test (file 54 §A).
The whole-bill correction (the K1 move, applied to the input side): "60–95%" is a per-payload ratio on compressible observations. On the modeled heavy day, tool outputs/observations are only part of the 61% cache traffic (the rest is the system prefix, CLAUDE.md, conversation history, and code reads that headroom passes through at ~0%). The realistic whole-bill effect is (compressible-observation share of the 61%) × compression% × (write-share + 0.1×read-share). With most observation tokens already living at the 0.1× read price after first write, the read-side win is worth a tenth of its face value — so even an aggressive deployment lands in the low double digits of the day's dollars, not 60–95% of the bill. That is still a real lever on the largest bucket; it is simply not the headline number.
How headroom compares to caveman and RTK
The full side-by-side comparison — headroom vs the caveman ecosystem vs RTK, the axis-by-axis table, the cache-safety asymmetry (output brevity is cache-neutral; input compression is cache-breaking unless done at write-time on new content), the family-overlap mapping, and the memory either/or — is consolidated as a single source of truth in the dedicated folder, not duplicated here:
- Token-optimization tools — overview + master comparison table
- Head-to-head: the feature has/lacks matrix and best-case-of-each
- Combining: the layered stack, the ecosystem overlap, and the memory either/or
The one structural point worth restating here because it drives headroom's whole design: output brevity is cache-neutral, but input compression is cache-breaking unless it is done at write-time on new content — which is exactly why headroom's live-zone / MCP path (compress a new observation before it is first cached) is cache-safe while its whole-prompt proxy in front of an already-caching Claude is not. The rest of this chapter is the dossier's headroom record: its typed compressors, the live-zone cache machinery, the H1–H4 technique records, the benchmarks, and the source ledger.
Genuinely new techniques (per-technique records)
These use the §10 record schema. Levers headroom merely productizes (log filtering, outlines, minification, context editing) are already recorded in Volume I–III and are not repeated here.
H1. Live-zone input compression — the cache-safe design point record 19 said did not exist
- Coverage-delta: Refines record 19 + file 46 FL3. Volume I/II treated input compression as monolithically cache-hostile; this records the cache-compatible sub-design.
- Layer: input + cache.
- Mechanism: split each request into a stable prefix and a volatile live zone; stabilize the prefix (extract volatile content to a tail, normalize tool definitions, insert
cache_controlat stable boundaries) and compress only the live zone, once, before it is first cached. The cached prefix stays byte-identical, so 0.1× reads survive; the compression shrinks the cache write of the new content and all future reads of it. - Expected savings: on the modeled day,
(compressible-observation share of the 61% cache bucket) × compression% × (write-share + 0.1×read-share). Real on the largest bucket, bounded to low-double-digit % of dollars because most observation tokens already read at 0.1×. NOT the 60–95% per-payload headline (ESTIMATE; arithmetic in the benchmarks section). - Evidence tier: T1 for the mechanism (the underlying log/outline/minify levers are locally reproduced in files 03/51); T3-weak for headroom's specific product numbers (vendor self-report, no independent replication); T2 academic backing for the write-time pattern itself — Squeez (arXiv 2604.04979 — 92% tool-output token removal at 0.86 recall, run as a write-time Unix pipe, cache-safe, code-domain) and AgentDiet (arXiv 2509.23586 — Claude 4 Sonnet 64.5%→66.5% with input −40–60%, the only paper in the class that nets out the compressor's own +5–15% cost, the net-accounting white-space #5 demanded).
- Quality risk: NEUTRAL on rule-based transforms (log/JSON/search/diff), RISKY on the ML text compressor (kompress-base can drop identifiers/caveats — the record-19 failure mode), RISKY in proxy mode (silent cache-bust if the prefix churns). Falsify by A/B on JSONL: confirm
cache_readcontinuity is preserved and tokens-per-solved-task drops net of overhead. - Availability:
CLAUDE-CODE-TODAYvia MCP (headroom_compress) /SDK(library) /GATEWAY-OR-SELF-HOST(proxy). - Effort to adopt: minutes (MCP) to hours (proxy + offline asset provisioning).
- Composability: composes with prompt caching (unlike record 19's LLMLingua) when scoped to the live zone; anti-synergy with proxy-mode-in-front-of-Claude-Code (double-stabilization) and with anything that mutates the prefix.
- Validation protocol: 20 tool-heavy tasks, native vs headroom-MCP; from JSONL require (a)
cache_readratio unchanged or better, (b) tool-result tokens down, (c) task success unchanged, (d) net tokens-per-solved-task down ≥20% after subtracting MCP schema + retrieve round-trips.
H2. Reversible compression with on-demand retrieval (CCR)
- Coverage-delta: New productization of white-space #8 ("output brevity with quality gates") and the progressive-disclosure idea behind record 02/15.
- Layer: input / retrieval.
- Mechanism: compressed content is stored verbatim in a CCR store (SQLite/Redis/in-memory backends in
headroom-core); the model receives a compressed view plus aheadroom_retrievetool and can fetch the original within a TTL when it needs full detail. Lossy compression becomes recoverable lossy compression. - Expected savings: the compression saving of H1, minus the cost of retrievals actually triggered. Net-positive only if retrieval rate is low; each retrieve is a tool-call round-trip (schema + request + the original payload re-entering context).
- Evidence tier: T3 (mechanism shipped and benchmarked by the vendor:
ccr_regression_benchmark.py,adversarial_ccr_tests.py); no independent measurement of net effect. - Quality risk: NEGATIVE-COST in principle (it removes the lossy-memory failure mode that makes cavemem/claude-mem RISKY) — if the model reliably knows when to retrieve. Failure mode: the model trusts a compressed view it should have expanded, or over-retrieves and erases the saving. Falsify by seeding tasks whose answer hinges on a detail that compression dropped; measure retrieve recall and net tokens.
- Availability:
CLAUDE-CODE-TODAY(MCP exposesheadroom_retrieve). - Effort to adopt: minutes (MCP); the store needs a backend choice for persistence.
- Composability: strengthens any lossy input/memory compressor; pairs with cross-agent memory (H4); orthogonal to caching.
- Validation protocol: detail-dependent canary suite (numbers, negations, "don't do X" buried in a compressed payload); require retrieve-or-correct behavior on 10/10 and net-positive tokens.
H3. Failure-mining into memory files (headroom learn)
- Coverage-delta: New — no equivalent in Volume I–III.
- Layer: input (memory) / meta.
- Mechanism: analyze past failed sessions across Claude/Codex/Gemini and write durable corrections into CLAUDE.md/AGENTS.md, so the always-loaded prefix improves over time instead of repeating mistakes. A closed self-correction loop over the memory file the dossier already prices.
- Expected savings: indirect — fewer repeated failures = fewer wasted retry turns (the most expensive waste, since retries pay full thinking + output). No published number; the cost is added prefix mass (record 07 rent: every CLAUDE.md line is cache-read rent on every call) and a risk of bloating the file past the "under 200 lines" guidance.
- Evidence tier: T4 (plausible mechanism, no measured net effect; failure-mining quality unverified).
- Quality risk: RISKY — an auto-written rule that is wrong or over-general is one bad PR that erases months of savings (record 07's failure mode), and unbounded auto-append violates CLAUDE.md slimming. Falsify by reviewing every auto-written rule before commit and replaying the rule-sensitive task set.
- Availability:
CLAUDE-CODE-TODAY(CLI command). - Effort to adopt: minutes to run; ongoing editorial discipline to keep the file lean.
- Composability: feeds record 07 (CLAUDE.md) and the jackin'
[token_policy]idea (file 32); anti-synergy with prefix slimming if left unbounded. - Validation protocol: human-gate every correction; cap the file size; A/B the failure rate on the task class the correction targets, and confirm the added prefix rent is smaller than the retries it prevents.
H4. Cross-agent deduplicated shared memory
- Coverage-delta: Extends record 02 (cavemem) / record 15 (claude-mem) / file 45 (cross-agent portability) with a cross-tool, auto-dedup angle none of them cover.
- Layer: input (memory).
- Mechanism: a single store shared across Claude, Codex, and Gemini, with automatic deduplication, so a fact learned in one agent is available (once) to the others instead of being re-derived per tool.
- Expected savings: unquantified by the vendor; the dossier's standing objection to all memory tools applies — no injection-cost-vs-re-exploration-saved accounting exists (white-space #5).
- Evidence tier: T4 (no net-accounting published, here or upstream).
- Quality risk: RISKY — the cavemem/claude-mem failure mode (stale or wrong recalled facts mislead a session) plus a cross-agent blast radius (a bad memory now corrupts three tools). Reversibility (H2) mitigates but does not remove it. Falsify by quizzing the store against source transcripts and auditing currency.
- Availability:
CLAUDE-CODE-TODAY(MCP/library), genuinely useful only for multi-tool operators. - Effort to adopt: minutes–hours (persistent store).
- Composability: competes with cavemem/claude-mem (pick one); pairs with CCR (H2) for recoverable recall.
- Validation protocol: the week-long memory A/B from record 02, run across two agents, metering the store's own compression/injection calls against re-exploration avoided.
Market delta — other context-compression projects (internet re-sweep)
A clean-room sweep for compression-layer projects the dossier's 03/46/51/52 do not already cover. Code-intelligence retrievers (codedb, fff, Serena, Code Context Engine, Claude Context, Sourcegraph, Augment, Qodo) are in file 51; vector backends are in file 52; this section is the compression/proxy/memory layer specifically. Numbers are vendor self-report unless marked; none is locally reproduced here.
| Project | Category | What it compresses | Claimed saving | Works with Claude Code? | Tier |
|---|---|---|---|---|---|
headroom (chopratejas/headroom) | Compression library + proxy + MCP | Tool outputs, logs, RAG, files, history | 60–95% (per-payload); 66.1% mixed | Yes (MCP/library/proxy) | T3-weak |
LLMLingua / LLMLingua-2 (microsoft/LLMLingua) | Prompt-compression proxy | Whole prompt (perplexity pruning) | up to 20× (NL) | Self-host; cache-hostile | T2 (record 19) |
| CompactPrompt | Prompt compression guide/lib | Prune + abbreviate + quantize data | "up to 60%" | Self-host | T4 (file 46) |
claude-mem (thedotmack/claude-mem) | Cross-session memory | Compressed memory observations | "~10×" retrieval-path | Yes | T3 (record 15) |
cavemem (JuliusBrussee/cavemem) | Compressed memory MCP | caveman-compressed memory | "~75% prose" | Yes | T4 (record 02) |
| Mem0 | Agent memory layer | Extracted/compressed memories | vendor benchmarks | Yes (API) | T4 (dossier K-mem: files-only beat it on LoCoMo) |
This list is the focused compression/memory layer; the broader retrieval and serialization market is covered in files 51, 52, and record 14. The pending internet re-sweep (parallel research streams) augments this table with any additional 2025–2026 compression proxies, observation compressors, or cross-agent memory systems that survive the skeptic pass; load-bearing additions will be merged here with their sources before this file's verdict is treated as final.
The standing pattern from the dossier holds: the compression-layer market crowds the buckets that are easy to demo (prose, memory, repetitive logs) and self-reports per-payload ratios as if they were whole-bill numbers. Headroom is the most serious engineering in the category and the only one with a credible cache-safe design, but it shares the category's two weaknesses — no independent net-accounting, and a hot-path security/latency cost.
Fresh-literature delta
Headroom is a direct test of the literature trends file 46 already tracked, and it sharpens two of them:
- Prefix-stable / cache-aware compression is now shipped, not just hypothesized. File 46 framed CAG (preload-and-cache, FL1) and the LLMLingua cache-conflict (FL3) as the two poles. Headroom's live-zone design is the missing middle: compress the variable tail while preserving the cached prefix. This does not overturn FL3 (whole-prompt recompression is still cache-hostile and still an attack surface) — it bounds it to the proxy-recompression case.
- The security axis (FL3 / CompressionAttack) applies to headroom directly. A compressor in the request path — especially the auto-downloaded
kompress-basemodel in proxy mode — is exactly the integrity boundary CompressionAttack (arXiv 2510.22963, ≤80% ASR) targets. This is a concrete reason to prefer MCP mode (compress specific, agent-chosen observations) over a transparent proxy that compresses everything. - The soft-prompt / learned-compression family (Gist, ICAE, 500xCompressor, LTSC) remains self-host-only for a hosted Claude operator (file 46 D): a frontier hosted model cannot read meta-tokens it was not trained on. Headroom's kompress-base is not in this family — it compresses to natural-language-ish text the hosted model reads normally, which is why it works on hosted Claude where the soft-prompt methods cannot. That is the category insight: on hosted APIs, only text-to-text compression is usable, and it is inherently lossy.
The parallel literature re-sweep (running) will extend this with any 2025–2026 work specifically on cache-aware or reversible compression and any independent benchmark of headroom; findings that change a verdict will be folded in with sources.
Corrections and refinements to prior files
- Refine record 19 / file 46 FL3. Restate the kill precisely: whole-prompt recompression in the hot path fights caching and is an attack surface; live-zone compression that stabilizes the prefix and compresses only the volatile tail is cache-compatible. Headroom is the worked example. The recommendation "no compressor in the hot path" becomes "no whole-prompt recompressor in the hot path; live-zone/observation compression is acceptable when it preserves
cache_readcontinuity, prefers MCP/library over transparent proxy, and is measured net of its own overhead." - Refine file 46 D ("no new lossy compressor both user-reachable on hosted Claude and safe for code"). Superseded by 2026 code-domain results that run as preprocessing on hosted models and raise SWE-bench accuracy: SWEzze/OCD (arXiv 2603.28119 — AST-aware, ~6×, resolution +5.0–9.2% on SWE-bench Verified), SWE-Pruner (arXiv 2601.16746 — Claude Sonnet 4.5 70.6%→72.0%, tokens −23–38%), LongCodeZip (ASE 2025 — training-free, 5.6× with no loss on code). Corrected verdict: do not compress code inside the cached prefix (query-conditional code compression is cache-breaking per instance), but compressing code/tool-output at ingestion of new content is now an evidence-backed, accuracy-neutral-or-positive lever. The Perplexity Paradox (arXiv 2602.15843) explains why naive LLMLingua still fails on code — 86.1% of its failures are NameError from dropped function identities, recovered by deterministic signature injection (+34pp) — which is precisely headroom's CodeAwareCompressor design (keep signatures, drop bodies).
- Refine white-space #8. "Output brevity with quality gates" now has a shipped input-side analogue: reversible compression (CCR) is the quality gate — the model can recover what compression dropped. The white-space item is partially filled on the input side; the output side (compress generation only when a verifier confirms zero loss) is still open.
- Correct the caveman evidence citation (file 03 record 01 / K1). The repo-cited "arXiv 2604.00025, brevity improved accuracy +26 points" is now verified to exist (file 03 flagged it unverified) — but it is a single unaffiliated author, unreviewed, its +26.3pp is on a cherry-picked 7.7% subset where verbose large models self-sabotage, and it tests no Claude model and no code task. Keep the citation; treat it as suggestive NL-only and do not propagate "+26 points" as transferable. The defensible "brevity can improve accuracy" evidence is Chain-of-Draft on Claude 3.5 Sonnet (arXiv 2502.18600: +4.1pp on the sports task at −92% output) — a modest single-digit effect, consistent with the dossier's existing caveman read.
- No change to the 10× verdict. Headroom attacks the 61% cache bucket, which is the right target, but its realistic whole-bill effect is bounded (per the K1-style correction) and it does not touch thinking (20%). The dossier's verdict stands: ≈2.5× defensible, ≈5–6.2× with validated routing, no honest 10× at zero quality loss. Headroom is a strong addition to the Aggressive stack's input layer, not a new multiplier that breaks the wall.
jackin' adoption recommendation
Headroom fits the same role-scoped, opt-in, measured-locally pattern file 51 set for code-intelligence tools, with one extra guardrail because it sits closer to the model.
- Pilot MCP mode, not proxy mode. Register
headroomas an MCP server inside a role container (user scope), exposeheadroom_compress/headroom_retrieve, and have the agent compress large tool outputs/observations on demand. This is the cache-safe path and keeps the compression auditable per call. - Never default the whole-prompt proxy in a jackin' container. It risks busting the cache Claude Code already manages, double-compacts against Claude Code's own context management, puts an auto-downloaded model in the hot path (latency + an offline/SSL-inspection asset to provision), and creates a CompressionAttack surface. If the proxy is evaluated at all, it must be an explicit, isolated experiment with cache-read continuity checked in JSONL.
- A/B against the levers the dossier already banks, not against a naive baseline: hook filtering (record 20), code-intelligence outlines (file 51), and serialization (record 14) already capture most of headroom's compressible wins, cache-safely and with no extra dependency. Headroom earns its place only if it beats that stack net of MCP schema rent and retrieve round-trips.
- Make host effects explicit. Headroom fetches the ONNX runtime and kompress-base over TLS and runs local processes; per the host-write ban, install and cache assets inside the container, and pre-provision the model for offline/sandboxed roles.
- Choose one memory layer. If adopting headroom memory, retire cavemem for that workflow (running both is pure overhead); keep the choice explicit and measured.
Validation harness
Run the same shape as file 51, with cache continuity added as a first-class metric:
| Arm | Tools allowed |
|---|---|
| Native | Claude Code defaults (hooks, Edit-diffs, deferred MCP) |
| Hooks | Native + record-20 grep/markdown filtering |
| Code-intel | Native + file-51 outline/symbol retrieval |
| Headroom-MCP | Native + headroom_compress/headroom_retrieve on observations |
| Headroom-proxy | Native behind the headroom proxy (cache-continuity watch) |
Metrics: tool-result tokens; cache_read ratio and cache-write spikes from JSONL (the make-or-break for any input compressor); retrieve count and retrieve token cost; total tokens per solved task; task success and test pass; wall-clock; MCP schema tokens loaded per turn.
Acceptance rule:
Accept headroom for token optimization only if, versus the Hooks+Code-intel arm:
task/test success >= baseline
cache_read ratio >= baseline (no silent cache-bust)
total tokens per solved task <= baseline by at least 20%
net of MCP schema rent and headroom_retrieve round-tripsPer the dossier's standing rule: a per-payload compression ratio is not a banked saving until it survives this harness on jackin' tasks at equal quality.
Claims to kill (headroom-specific graveyard)
| # | Claim in the wild | Verdict and corrected reading |
|---|---|---|
| H-K1 | "headroom cuts 60–95% of your tokens" | Per-compressible-payload ratio, not whole-bill. Repetitive logs/JSON hit 87–94%; code and grep compressed 0% in v0.5.18 ("passes through to preserve correctness"); the representative mix is 66.1%. Headroom's own production telemetry: median 4.8% / P75 6.9% / mean 11.3% whole-session across 50k+ proxy sessions, 40–80% only on heavy tool-use. Independently measured at 47.5% whole-session on a tool-heavy coding session (Miya-Gadget, 2026-06-03; RAG prose 0%, logs 31%) and "~50%" (HN). Whole-bill effect = compressible-observation share × compression × (write + 0.1×read) — low-double-digit % of dollars, same category as the caveman K1 correction. |
| H-K2 | "96.2% total savings on Anthropic" | Double-counts caching Claude Code already banks (K4). Caching's 90%-off is the floor, not a marginal saving; headroom's incremental lever on Claude Code is the live-zone compression fraction only. |
| H-K3 | "Input compression breaks the cache, so headroom can't help" | Too broad. Whole-prompt recompression breaks the cache (record 19 holds); headroom's live-zone design stabilizes the prefix and compresses only the volatile tail, which is cache-compatible in MCP/library mode. The kill is the proxy-in-front-of-Claude-Code case, not headroom as a whole. |
| H-K4 | "Same answers" (lossless) | Lossless only at low compression on prose/QA, and on rule-based transforms. The ML text compressor and high-compression code paths are lossy; "same answers" is unverified at high compression on code and untested on thinking. Reversibility (CCR) mitigates if the model retrieves when it should. |
| H-K5 | "50–90%" (PyPI) vs "60–95%" (README) | The project's own headline range is inconsistent across surfaces — a sign the number is a marketing band, not a measured constant. Treat any single percentage as directional and measure locally. |
| H-K6 | "Drop it in as a proxy, zero code changes, free win" | In front of Claude Code the proxy is a cache-bust risk, a double-compaction risk, a hot-path latency cost, and an attack surface (FL3). "Zero code changes" is true; "free" is not. |
Source ledger
All accessed 2026-06-15.
The complete consolidated ledger for all three tools together — plus the formal per-technique records and the unverified-claims register — is maintained in the hub: Records, ledger & unverified. The chapter-specific citations are retained below as the original research record.
- headroom repo + README: github.com/chopratejas/headroom
- headroom stats (28,185★ / 1,908 forks / 30 contributors / 268 issues / Apache-2.0 / created 2026-01-07 / v0.25.0):
gh api repos/chopratejas/headroom - headroom source tree (cache_stabilization, live_zone, ccr, transforms, benchmarks):
gh api repos/chopratejas/headroom/git/trees/main?recursive=1;cache_controlin 62 files (gh api search/code) - headroom docs (intro, how-compression-works, proxy, cache-optimization, architecture): headroom-docs.vercel.app/docs and the repo's
docs/content/docs/*.mdx - CacheAligner verbatim ("extracting dynamic content and moving it to the end... prefix stays byte-identical... KV cache can reuse"; auto
cache_control; "96.2% total savings"): headroomdocs/content/docs/cache-optimization.mdx - benchmark numbers (code search 17,765→1,408; SRE 65,694→5,118; triage 54,174→14,761; exploration 78,502→41,254; mix 23,921→8,110; v0.5.18 grep/Python 0%; GSM8K/TruthfulQA/SQuAD/BFCL accuracy): headroom README +
docs/benchmarks.md - kompress-base model (transformer trained on agentic traces, auto-downloaded default text compressor): huggingface.co/chopratejas/kompress-base
- PyPI (190 releases, v0.25.0, requires-python ≥3.10, summary "Cut costs by 50-90%"): pypi.org/project/headroom-ai
- secondary write-ups (tutorial/promotional, all repeating maintainer numbers; explicit "measure on your own workloads" caveat): subratpati.medium.com; alphamatch.ai/blog/headroom-context-compression-ai-agents-2026; andrew.ooo/posts/headroom-context-compression-llm-agents-review; dev.to/arshtechpro
- cross-references (caveman/cavemem/cavekit/cavecrew records, K1/K4, white-space map):
03-prior-art-and-market-scan.md - cache-conflict + CompressionAttack + CAG/FL1/FL3:
46-fresh-literature-and-market-delta.md - log-filter −94.2% / TOON −41.2% / minify −34.3% local reproductions:
03-prior-art-and-market-scan.md - outline −91% / symbol-search −98% local reproductions:
51-code-intelligence-tools.md - fresh-literature sources (write-time compression, code-domain SWE-bench gains, cache break-even, output-brevity dominance, context rot) and the compression-tool market sweep:
54-context-compression-literature-and-market.md. Key inline IDs: Squeez arXiv 2604.04979; AgentDiet arXiv 2509.23586; "Don't Break the Cache" arXiv 2601.06007; Claude 4.5 compression RCT arXiv 2603.23525; SWEzze arXiv 2603.28119; SWE-Pruner arXiv 2601.16746; Perplexity Paradox arXiv 2602.15843; Chain-of-Draft arXiv 2502.18600; brevity-hierarchy arXiv 2604.00025.