# 09 — Gaps, open questions, and the next-research brief (https://jackin.tailrocks.com/research/token-optimization-tools/09-gaps-open-questions-and-next-brief/)



# 09 — Gaps, open questions, and the next-research brief [#09--gaps-open-questions-and-the-next-research-brief]

This page is the deliberate "what did we miss?" pass over the hub. Even with the comparison consolidated and the records preserved, an honest research artifact names its own blind spots. This page collects: (1) the dimensions the hub has *not* compared, (2) capabilities surfaced by a fresh web sweep that the earlier snapshot understated or got wrong, (3) the questions that remain unanswered even though the topic looks covered, and (4) a ready-to-run `/goal` brief for the next round. A fresh web sweep was run 2026-06-20; everything new is tiered and the unverified parts are marked.

<Aside type="note">
  **Biggest miss of all, now closed: an entire fourth tool.** When this hub was a three-way comparison it omitted **lean-ctx** — the integrated context runtime that is arguably the most ambitious tool in the category and the only one that ships the persistent code-graph lever the three-way said *no one* had. It is now added ([04 — lean-ctx design](/research/token-optimization-tools/04-leanctx-design/)) and folded into every matrix, with a from-source build + benchmark on [page 10](/research/token-optimization-tools/10-first-party-measurements/). That this category-defining tool was missed for an entire research round is the strongest argument for this page's existence — and for keeping the comparison's scope under active review (which adjacent tools below are *next*). The misses recorded below are the historical record of earlier rounds; lean-ctx-specific open questions are in §C.
</Aside>

## TL;DR — the biggest misses (historical record) [#tldr--the-biggest-misses-historical-record]

* **We have zero first-party numbers.** The validation harness (page 06) is designed but has never been *run*. Every percentage in this hub — caveman 58.5%, headroom 47.5%/4.8%, RTK 60–90% — is vendor or community self-report on a non-Claude tokenizer. The single highest-value next step is to run the harness on this repo and produce the first controlled measurement.
* **RTK's reach was under-stated, and it has moved.** RTK ships `rtk read` / `rtk grep` / `rtk find` / `rtk diff` wrappers (with `-l minimal|aggressive` levels) — so if the agent is *steered* to prefer them (via `AGENTS.md`/`RTK.md`/hook), RTK reaches file reads and searches, not just incidental Bash. The hard limit is narrower than "Bash-only": RTK cannot intercept the *native* `Read`/`Grep`, but it offers its own compressed equivalents. (T1, rtk-ai.app, 2026-06-20.)
* **A whole axis is uncompared: security, privacy, and supply-chain.** Data egress (headroom proxy sees all traffic), supply-chain (headroom auto-downloads an ML model; caveman writes hooks into `~/.claude`; RTK ships a binary), injection surface, the RTK permission-bypass issue, and the CompressionAttack surface are scattered across pages but never set side by side.
* **Project health and business model are uncompared.** RTK now has a commercial arm — **RTK Cloud** (team cost analytics, "Free for open-source, Teams from $15/dev/month," waitlist) — while caveman and headroom are solo-maintainer open source. Sustainability, cadence, and bus-factor are decision factors we never weighed.
* **Interaction with Claude Code's own context management is unmapped.** Microcompact, `/compact`, auto-compaction, and context-editing all touch the same tokens these tools do; only headroom's double-compaction risk was noted. There is no conflict-or-compose matrix.

## A. Capabilities the snapshot understated or got wrong (fresh sweep, 2026-06-20) [#a-capabilities-the-snapshot-understated-or-got-wrong-fresh-sweep-2026-06-20]

| Finding                     | What the hub said                                           | Corrected / added                                                                                                                                                                                                                                       | Tier                                       |
| --------------------------- | ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------ |
| RTK file/search wrappers    | Framed reach as "Bash-only; native `Read`/`Grep` bypass it" | RTK ships `rtk read`/`grep`/`find`/`diff` with `-l` compression levels; steer the agent to them and RTK *does* reach reads/searches. Native-tool interception is still impossible; the wrappers are the workaround.                                     | T1 (rtk-ai.app)                            |
| RTK Cloud                   | Not mentioned                                               | Official commercial product — team cost analytics, token-per-dev reports, rate-limit monitoring, SSO/audit logs; "Free for open-source, Teams from $15/dev/month"; coming-soon/waitlist                                                                 | T1 (rtk-ai.app) — **roadmap, not shipped** |
| RTK via MCP                 | "Not an MCP server"                                         | True for the *official* repo (hook + CLI only). A *third-party* RTK↔MCP bridge is listed on mcpmarket.com; unverified (the listing 429'd on fetch)                                                                                                      | **Not proven** (T4)                        |
| RTK version hygiene         | `v0.42.4`                                                   | Site/footer confirm `v0.42.4`, but the README's own verification text still shows `rtk 0.28.2` — a doc-drift signal that the fast (212-release) cadence outruns the docs                                                                                | T1 (data-quality caveat)                   |
| Adjacent tools in the stack | Hub compares only the three                                 | The community stack now also names **TokenSave**, **Claude Code Router** (model routing), **code-review-graph** (49× on a monorepo), **context-mode**, and a **Codebase-Memory MCP** — out of scope here but they change the "best whole stack" picture | T3 (community write-ups)                   |

These corrections are folded into pages [03](/research/token-optimization-tools/03-rtk-design/) and [07](/research/token-optimization-tools/08-records-ledger-and-unverified/); they are listed here too so the "what changed" is auditable.

## B. Dimensions the hub has not compared (missed axes) [#b-dimensions-the-hub-has-not-compared-missed-axes]

> **Update (2026-06-20):** all six axes below are now compared side-by-side in [10 — Extended comparison axes](/research/token-optimization-tools/11-extended-comparison-axes/). They are kept listed here as the record of *what was missing*; what still remains is the per-axis *token delta* under a controlled run (harness backlog — see [10](/research/token-optimization-tools/10-first-party-measurements/)).

1. **Security / privacy / supply-chain threat model.** Per tool: *data egress* (headroom's proxy mode sees every byte of every request — a remote or misconfigured proxy is a data-leak path; RTK and caveman do not move data off-box), *supply-chain* (headroom auto-downloads `kompress-base` over TLS — an unpinned model is an integrity boundary; caveman is npm + hooks into `~/.claude`; RTK is a single binary + a hook), *injection* (RTK runs subprocesses — shlex/quoting matters), *known issues* (RTK issue #886 reportedly bypasses Claude Code permission prompts), and *model-integrity* (CompressionAttack targets exactly headroom's ML stage). None of this is in one comparison.
2. **Project health and sustainability.** Maintainer bus-factor (caveman, headroom = solo; RTK = has a commercial arm), release cadence vs doc drift, open-issue backlog (RTK \~1,254 open), license posture for commercial use, and whether RTK Cloud's monetization changes the OSS tool's incentives.
3. **Interaction with Claude Code's native context management.** A conflict-or-compose matrix across microcompact (no-LLM redundant-tool-output trimming every turn), `/compact` + auto-compaction (history rewrite — a cache-write spike), and context editing. Each overlaps these tools' territory; only headroom's double-compaction was flagged.
4. **Build-vs-buy for caveman.** caveman is, mechanically, a Claude Code output-style plus two hooks. Is the plugin worth it over a hand-written \~5-line custom output style (which has no `~/.claude` hook footprint and no \~940-token skill listing)? Never weighed.
5. **Subscriber (tasks-per-cap) economics per tool.** The dossier has the metric (chapter 41); the hub never asks *which of the three most extends a Max-plan 5-hour window* — a different ranking from $-per-task, and the metric most operators actually feel.
6. **Non-coding / multi-domain behavior.** All three are evaluated on coding. Their behavior on docs, data analysis, or non-code agentic work is unknown — caveman's register on prose, headroom's typed compressors on non-code payloads, RTK on non-dev shell output.
7. **Tokenizer-trust consolidation.** RTK counts at \~4 chars/token, headroom states no tokenizer, caveman's eval uses tiktoken `o200k_base` — none uses Claude's actual BPE, and the Fable/Opus tokenizer bills \~30% more on English. Every cross-tool percentage is therefore soft in a *consistent direction*; the hub notes this per-tool but never states the combined implication for the head-to-head.

## C. Unanswered questions (even where the topic looks covered) [#c-unanswered-questions-even-where-the-topic-looks-covered]

* **Which tool wins per dollar on *our* workload?** Unknown — the harness has not been run. All rankings are reasoned, not measured.
* **What fraction of a real session's observation tokens actually flow through Bash?** This single number bounds RTK's whole-bill ceiling, and it is unmeasured on this repo.
* **What fraction of output is thinking on our traffic?** Bounds caveman's ceiling; the 20%/54.8% figures are n=1 from one environment.
* **Does register compression measurably degrade agentic task success?** No benchmark of caveman-style output against SWE-bench-style success exists *anywhere* — the largest standing quality question.
* **Does headroom's `kompress-base` drop load-bearing identifiers on real code at the compression levels actually used?** Untested at high compression on code.
* **Do these hooks/proxies survive Claude Code version updates?** The `~/.claude` hook surface and the proxy contract are version-coupled; no regression story.
* **Where exactly does each tool's data go, and who can read it?** No per-tool data-flow diagram / threat model.
* **Is the community "90%+ / 30 min → 3 hr" stack headline real in dollars on a controlled run?** Only ever self-measured per-tool, never A/B'd.

### lean-ctx-specific open questions (new this round) [#lean-ctx-specific-open-questions-new-this-round]

* **Does lean-ctx's single integrated runtime beat the two-tool stack (RTK + headroom-MCP) on tokens-per-solved-task?** The whole "integrated runtime vs layered specialists" verdict turns on this, and it has never been A/B'd — lean-ctx was too young for the published month-long head-to-head.
* **Does the persistent code graph net out positive?** The graph + BM25 + (opt-in) embedding index cost build time, disk, and `[related: …]` injection tokens on every read. Whether the structural-retrieval saving (fewer re-reads to find things) exceeds that injection cost is unmeasured — the same net-accounting complaint the dossier makes of all memory/index tools.
* **How bad is `map` mode's 77% quality on real tasks?** The measured quality drop is self-rated; whether it degrades agentic success (vs the 95.9%-quality `signatures` mode) needs the canary suite.
* **Does the Lean 4 proof crate cover anything load-bearing?** The proofs exist; their scope was not audited. Formal verification of a compressor's invariants would be genuinely novel — *if* the theorems are about the savings/safety claims rather than trivia.
* **Is the bounce-netted signed ledger trustworthy as the harness's measurement layer?** lean-ctx already nets out wasted re-reads and signs the result; if that ledger is sound it could *be* the harness's token accounting for the lean-ctx arm — but it counts on a GPT tokenizer, so cross-check against Claude `count_tokens`.
* **Does lean-ctx's footprint (daemon, dashboard, DBs, 34-agent host writes) survive a jackin' container's host-write ban and `/jackin/`-only layout?** It is the broadest host-mutating tool of the four; container-fitness is unverified.

## D. The next-research `/goal` brief (ready to paste) [#d-the-next-research-goal-brief-ready-to-paste]

Paste this into Claude Code to drive the gap-closure round. It is scoped to add value on exactly the misses above and to keep the hub the single source of truth.

```text
/goal Run the token-optimization-tools gap-closure + first-A/B round.
The dedicated hub at /research/token-optimization-tools/ is the single source of truth —
extend it in place, keep other pages referencing it, do not create a new branch.
Keep every prior number; correct only what is blatantly wrong; mark vague claims
"not proven". Every external claim needs a source + access date; every local number
needs its method. Commit and push per artifact.

Priorities, highest value first:

1. RUN the validation harness (hub page 07) for real on this repo. >=10 paired tasks
   per arm: Native / Caveman / Hooks(hand-written filter) / RTK / Headroom-MCP /
   lean-ctx(MCP+hook) / full stack. From session JSONL measure tokens-per-solved-task,
   cache_read continuity, command-re-run/bounce rate, and task success. Produce the
   FIRST controlled cross-tool numbers and record raw data + method. Settle the
   headline open question: integrated runtime (lean-ctx) vs layered stack (RTK +
   headroom-MCP) on tokens-per-solved-task. Highest-value deliverable.

2. Measure the bounding fractions on this repo's real traffic:
   (a) share of observation tokens that flow through Bash (bounds RTK),
   (b) share of output tokens that are thinking (bounds caveman), and
   (c) for lean-ctx: does the code-graph [related:] injection + index cost net out
       positive against re-reads avoided?

3. Re-verify each tool against its LATEST release (they ship fast; the snapshot is
   stale): lean-ctx (read modes, code-graph/RRF, proxy cache-safety, Cloud pricing,
   Lean-proof coverage, footprint), RTK (read/grep/find/diff wrappers, Cloud status),
   headroom (kompress-base version, learn, proxy metadata cost), caveman (levels,
   output-style parity). Update the hub; mark anything unconfirmed "not proven".

4. The six extended axes (page 11) now cover all four tools — refresh them with the
   harness's per-axis token deltas once measured, and audit whether a NEW tool has
   appeared that belongs in the comparison (the lean-ctx omission is the cautionary
   precedent).

5. Resolve hub page 09's open questions — including the new lean-ctx-specific set
   (map-quality on real tasks, Lean-proof scope, ledger trustworthiness, container
   fitness): answer with evidence or mark explicitly still-open. Add any newly-found
   gap to page 09.

Definition of done: hub has first-party multi-arm harness numbers incl. lean-ctx; the
integrated-vs-stack question is answered; every open question is answered or marked
open; nothing prior was lost; build + link checks pass.
```

## Status update (2026-06-20 gap-closure round) [#status-update-2026-06-20-gap-closure-round]

This round ran the first first-party measurement ([10](/research/token-optimization-tools/10-first-party-measurements/)) and added the six axes ([10](/research/token-optimization-tools/11-extended-comparison-axes/)). Resolution of §C's questions:

| Open question (§C)                                                               | Status                                                                                                                                     | Where                                                                    |
| -------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------ |
| Which tool wins per dollar on our workload?                                      | **OPEN (partial)** — the full 6-arm A/B is not self-runnable in one session; local-transcript measurement done                             | [10](/research/token-optimization-tools/10-first-party-measurements/)    |
| Share of observation tokens that flow through Bash?                              | **ANSWERED** — 16.5% on this (docs/research) workload; native `Read` 76.2% (RTK-blind). Workload-specific                                  | [10](/research/token-optimization-tools/10-first-party-measurements/)    |
| Share of output that is thinking?                                                | **PARTIAL** — thinking is redacted in JSONL; exact split needs `count_tokens`; dossier n=1 54.8% stands; visible prose is empirically tiny | [10](/research/token-optimization-tools/10-first-party-measurements/)    |
| Does register compression degrade agentic task success?                          | **STILL OPEN** — no benchmark of caveman-style output vs task success exists anywhere                                                      | —                                                                        |
| Does headroom `kompress-base` drop identifiers on real code at high compression? | **STILL OPEN** — untested                                                                                                                  | —                                                                        |
| Do the hooks/proxies survive Claude Code version updates?                        | **STILL OPEN** — no regression story                                                                                                       | —                                                                        |
| Where does each tool's data go (threat model)?                                   | **ANSWERED** — data-egress / supply-chain / injection compared                                                                             | [11 §a](/research/token-optimization-tools/11-extended-comparison-axes/) |
| Is the "90%+ / 30 min → 3 hr" stack real in dollars on a controlled run?         | **PARTIAL** — it is a tasks-per-cap/occupancy win, not a dollar cut; the controlled $ A/B is still open                                    | [11 §e](/research/token-optimization-tools/11-extended-comparison-axes/) |

**New gaps found this round** (added to the backlog): the build-vs-buy token delta (caveman plugin vs a hand-written output-style — reasoned, not measured); RTK Cloud's open-core risk (features possibly gated to the paid tier over time); and an exact `count_tokens`-based thinking-share measurement on this repo. The single highest-value item remains the full controlled harness run — once it lands, its numbers become the hub's primary evidence and the vendor self-reports drop to corroboration.

## Status update (2026-06-20 lean-ctx round) [#status-update-2026-06-20-lean-ctx-round]

This round added the fourth tool, **lean-ctx**, after recognizing it was missing entirely from the three-way comparison. New artifacts: the [04 — lean-ctx design](/research/token-optimization-tools/04-leanctx-design/) teardown; a from-source build + benchmark on [page 10](/research/token-optimization-tools/10-first-party-measurements/) (the first time a tool in this hub was compiled and run, not just read about); lean-ctx columns/rows across every matrix (index, [05](/research/token-optimization-tools/05-head-to-head/), [07](/research/token-optimization-tools/07-evidence-and-claims/), [11](/research/token-optimization-tools/11-extended-comparison-axes/)); an L1 record + ledger + unverified rows on [page 08](/research/token-optimization-tools/08-records-ledger-and-unverified/); and a reframed [combining page](/research/token-optimization-tools/06-combining/) treating lean-ctx as the live test of the "is there one product?" thesis. The folder was also renamed `caveman-headroom-rtk` → `token-optimization-tools` so future tools join without another rename. &#x2A;*Still open:** the controlled multi-arm A/B (now including a lean-ctx arm) and the lean-ctx-specific questions in §C — the integrated-runtime-vs-stack verdict remains *reasoned*, not *measured*.

## E. How this page stays honest [#e-how-this-page-stays-honest]

The point of this page is that it is never "done": each research round should answer some of section C, add what it finds to section B, and refresh section A against the tools' latest releases. When the harness in (D.1) finally runs, its numbers replace the vendor self-reports as the hub's primary evidence — at which point the whole comparison graduates from *reasoned* to *measured*.

***

Back to the [overview](/research/token-optimization-tools/), or the [evidence and claims](/research/token-optimization-tools/07-evidence-and-claims/) page.
