Token-Optimization Research Dossier

Definitive research dossier on extreme token optimization for coding-agent usage at zero quality loss. Every external claim carries a source URL; every local number carries its method. Specification: prompt.md.

Headline numbers

10x verdict: not defensible at zero quality loss today. Defensible: ≈2.5x with validation (Aggressive stack; ≈2.4x on code-heavy mixes), ≈5–6.2x if the Sonnet-main+advisor routing flip passes the harness on your tasks. Binding constraints: frontier-model thinking output, then the cache-read floor of genuinely-used context. (30)
Where money went in the measured heavy session: cache reads 32% / cache writes 29% / thinking ~20% / visible output ~17% / uncached 2%; an independent session measured output-heavy instead. Stable invariant: cache reads dominate token volume, while output + cache writes dominate dollars. (02, 50)
Defaults already bank ~4–5x: caching measured −86.3% input-side this very session; MCP schemas defer by default; Edit-diffs are default. Much of the market re-sells these. (13, 12)
Caveman-ultra measured 58.5% token cut on visible prose (claims say 65–75%); wenyan's 80.9% char cut collapses to 56.6% tokens. Style compression caps at ~17% of dollars — and at 1.4–1.5% of visible output in tool-heavy sessions. (02, 10)
Strongest sanctioned lever on thinking: effort (T1: Opus 4.5 medium = equal SWE-bench at 76% fewer output tokens). Strongest input lever: context architecture (tool search 85% cut with accuracy 49%→74%; context editing 84% cut with +29% performance). (15, 12)
Fable 5/Opus 4.8 tokenizer bills ~30% more tokens on English/ASCII-heavy text, but the premium is near-neutral on code/CJK probes — cross-tier routing saves list price on code-heavy work (Fable→Sonnet ≈ ÷3.3) and up to ≈÷4.3 on prose/markdown-heavy text. (11, 16, 50)

How to read

Start: 00-executive-summary.md — the stack, the math, the verdict, the graveyard.
Foundations: 01 economics + instruments, 02 measured baseline of this environment, 03 market scan (incl. operator-named caveman / cavemem / cavekit / fff).
Research areas 10–20: one file per area, every technique in a fixed record schema (110 techniques cataloged, all with validation protocols).
Synthesis: 30 composed stacks with dollar math, 31 runnable no-quality-loss protocol, 32 day-1/week-1/month-1 with automatic-vs-disciplined split.

Tier list

value = expected $ saving on the modeled profile × confidence (evidence tier) ÷ adoption effort

Tier	Techniques (file)
S	Tool search / MCP schema deferral (12) · context editing + observation masking (12/14/18) · effort tiering incl. max→high (15) · subagent model+effort pinning (16) · Edit-over-Write + no-restatement guards (15) · advisor-pattern escalation (16) — all NEGATIVE-COST or vendor-validated
A	Cache hygiene: task-boundary model/effort switching only (16/13) · batch lane for offline work (18) · repo-maps/outlines instead of file dumps (12) · structured-output sidecars (15) · state-file session resume (14) · register compression in chat-heavy workflows (10)
B	Structured-data format choice — CSV/compact lines/TOON (11) · pointer architecture & lazy instruction loading (14) · session codebooks (20) · ID/timestamp hygiene — epoch, surrogate IDs (11) · hook dedup and prefix audits (02/12) · CI token-budget linter (20)
C	Register compression in tool-heavy workflows — ceiling ~0.4% of output (02/10) · identifier-casing policy, design-time only (11) · instruction-side register compression — 50x less valuable per token than output side (10) · prefill for non-thinking sidecars, dying (15)
F	Wenyan registers — no token gain over ultra, higher risk (02/10) · LLMLingua-style proxy for coding (19) · semantic response caches for agents (14) · cache-keepalive pingers for Claude Code (20) · base64/gzip "compression" — costs 2.7–4.3x MORE (19) · cl100k-based "Claude calculators" (11) · max_tokens as an optimizer (15) · glyph/symbol prompt DSLs (10)

Assumptions (judgment calls made during the run)

All "current" claims were verified against live provider documentation.
No ANTHROPIC_API_KEY present; the free count_tokens endpoint was called with the Claude Code OAuth credential already on this machine (no billable usage). The brief explicitly mandates count_tokens use.
Only this run's transcripts existed locally; thinking-share and session decomposition are n=1-environment measurements (max-effort main loop + a 25-agent fleet), labeled as such wherever used.
"Deliverables exactly as specified" = the 19 files of §10 and nothing else in the folder; measurement scripts are embedded in reports as reproducible snippets.
Operator mid-run instructions were folded in: (a) cavemem / cavekit / fff and the industry-standard/proven/engineer-verified buckets → 03-prior-art-and-market-scan.md; (b) the request to "add this to token-optimization.md" was interpreted as the dossier (the brief forbids modifying pre-existing repo files, including the brief itself); (c) chat output kept in caveman-ultra; dossier files follow the brief's own writing rules (plain language, full sentences) as the deliverable spec.
Heavy-day profile band: $17/day (5 sessions, 45% thinking) floor and $22/day (6 sessions, 55% thinking) working figure; area files and stack math use $22; ratios are profile-invariant. (01 §5)
Mid-run, five workflow draft agents died on a session rate limit (reset 19:20 UTC); the run continued on usage credits per the operator's local action. Files 17 and 20 were re-drafted from the already-completed research JSON by follow-up agents.
An environment quirk repeatedly deleted freshly-written untracked files in the worktree (subagent cleanup race). Countermeasure: every artifact was committed from the main process within seconds of landing, and two files were restored from agent-transcript payloads. No content was lost; the incident is noted because it shaped the commit cadence.

Self-audit against the Definition of Done

Volume II — Extension

**(Volume I froze; all Volume II claims pinned to 06-13 with live re-verification, sources + access dates in each file's ledger). Volume II is an additive layer on top of the frozen Volume I (files 00–32 unedited); it fills the gaps Volume I left blank or drew too thin. Governing gap audit and extension scope: 40-extension-overview.md.

Volume II index (40–49 band)

40 — gap audit: independent six-axis taxonomy overlaid on Volume I, the blind-spot map with file:line evidence, and the Volume II index.
41 — the quota-weighted cost model for a capped subscriber (blind spot 1).
42 — image/screenshot/PDF token costs, measured locally (blind spot 2).
43 — wall-clock/human-time as a second cost axis (blind spot 3).
44 — hosted cross-container/fleet cache economics (blind spot 4).
45 — portability matrix across coding agents (blind spot 5).
46 — clean-room re-sweep; KV-eviction family, CAG, changelog drift (blind spot 6).
47 — cost of optimizing, budget governance, online quality guards (blind spot 8).
48 — 8 new frontier ideas (not duplicating K1–K16).
49 — coverage-delta ledger, verdict delta, Corrections to Volume I, stack/tier updates, Volume II graveyard.

Volume II headline numbers

10x dollar verdict unchanged: ≈2.5× / ≈5–6.2× with validated routing / no true 10×. No Volume II lever removes Volume I's binding constraints (frontier-model thinking output; the cache-read floor). (49)
The metric is wrong for a subscriber. The local credential is Max; below the cap dollars are sunk and the objective is tasks-per-cap. Volume II ships a second (quota) cost model alongside the dollar model. Cap cache-read weight ≈ 0.1× (community-triangulated, T3); the cap token denominator is unpublished (bounded INCOMPLETE). (41)
Multimodal, measured (count_tokens): image = ⌈w/28⌉·⌈h/28⌉ visual tokens, with high-resolution caps around ~4,760 (Opus/Fable) vs ~1,520–1,570 (Sonnet/Haiku), a ~3.0–3.1× per-image divergence; PDFs cost ~3,150 tok/page and ~2× the equivalent text (the "PDF tax"); a screenshot of textual content is 2–6× the text it shows. (42)
Latency is priceable: the same Opus 4.8 spans 4× on the latency axis (batch $2.50 / standard $5 / fast $10 input); buy speed only when a human is blocked (v·t·s > Δ$). (43)
Drift since 06-12 (re-verified): count_tokens rejects Fable 5 (use Opus 4.8 — its tokenizer twin); Fable 5 leaves the subscription 06-23 (operator's effective main model → Opus 4.8, ~½ the sticker); 5-hour limits doubled 06-05; 06-15 headless/SDK usage split off the cap; KV-eviction family (SnapKV/H2O/PyramidKV/KVQuant) and CAG are real but self-host-only on hosted Claude. (41, 46)
50 genuinely-new techniques (42 in files 41–47 with the full §10 record + 8 frontier), each with a coverage-delta note proving absence from 00–32. (49)

Blind-spot map (summary)

Eight seeded blind spots audited by overlaying an independent taxonomy on Volume I (14-agent coverage sweep + grep). Five confirmed thin/absent → full area files: quota (41), multimodal (42), latency-axis (43), portability (45 — no matrix existed), governance + online-quality (47). Three partial → sharpened: fleet (44 — self-host done in 19; hosted sharing was thin), fresh-lit (46 — strong scan, specific holes), and Volume I's own open questions (worked and distributed, collected in 49). Full map with file:line evidence and per-cell stake: 40.

Verdict delta (one line)

Dollars: no change (≈2.5× / ≈5–6.2× / no 10×, arithmetic in 49/50). Metric: changed — for a subscriber optimize tasks-per-cap, where the lever order re-sorts (prefix stability, window size, request-volume up; subagent fan-out partially inverts; style compression matters even less). Volume I's Fable-priced dollars are ~2× high for the operator's actual Opus 4.8, but ratios/tiers are unchanged.

Volume II Assumptions (judgment calls)

Research date. Live re-verification done; the load-bearing drift (Fable 5 not count_tokens-able; Fable promo ends 06-23; 5-hour doubling; 06-15 SDK split) is flagged where used.
Instrument: count_tokens via the OAuth credential (claudeAiOauth.accessToken), free/non-billable, rebuilt at /tmp/ct.py (Volume I's copy did not persist — fresh container). Fable-family tokenizer measured on claude-opus-4-8 (its documented twin), labeled wherever used.
Local environment: Opus 4.8 main + Haiku 4.5 subagents, effort=max, Max subscription (~/.claude/.credentials.json). Token-class decomposition from 31 transcripts / 560 calls.
Test media (images/PDFs) generated from the Python stdlib (zlib) — no PIL/ImageMagick on the box — and validated against 5 real repo PNGs and Anthropic's published cost table; the image curve was adversarially re-confirmed with a max-entropy noise image (content-independent).
Quota model carries a bounded INCOMPLETE: the cap token denominator and the exact cap cache-read weighting are unpublished (confirmed across 6 primary pages + 3 GitHub issues). The ~0.1× weight is community-triangulated (T3); true cap-% needs the unified-* response headers (/usage or a proxy), not run this pass (frontier V2).
Open questions still open (honestly): the effort→thinking-share curve (all local transcripts are a single effort level — unmeasurable this run), the per-account cap denominator (needs a header- reading proxy), and the exact SDK excludeDynamicSections byte size (reconstructed estimate ~111 tokens). Each is flagged in its file.
Seven area files (41–47) were written, exceeding the ≥5 floor; fleet (44) was kept distinct (not merged) because the hosted-fleet material proved genuinely separate from Volume I 19's self-host tier.
Multi-agent machinery: an E0 coverage-map workflow (14 read-only readers) and an E1 fresh-sweep workflow (11 web-research streams); all deliverables were written and committed from the main process within seconds of landing (Volume I's file-deletion-race countermeasure).
Cache-layer and subagent-caching caveats (44/49): the hosted server prompt cache is workspace-scoped, not machine/dir-keyed; subagent caching can be version-dependent — audit your own JSONL before relying on it.

Volume II self-audit against the Definition of Done

Volume III — tooling and external-tool comparison

Adds runnable measurement scripts and a comparison of external code-search / code-intelligence tools.

The token-optimization-tools comparison now has its own dedicated, diagram-driven folder: token-optimization tools consolidates and deepens the material in files 53/54/56 — equal-depth design teardowns of caveman, headroom, RTK, and lean-ctx (the integrated context runtime added in a later round), a feature has/lacks matrix, best-case-of-each, and a straight answer to whether one product can combine them all. Files 53/54/56 below remain for their broader scope and full source ledgers.

Runnable tools/ — count_tokens.py, image_tokens.py, session_cost.py reproduce the dossier's core numbers against the live Anthropic tokenizer: real token counts, the image-token formula, and the dollar/token split deduplicated by message.id.
51-code-intelligence-tools.md — deep dive comparing codedb, Codegraff, and fff — whether they help AI coding agents and save tokens. They productize the same context-architecture lever (serve outlines/symbols, not whole files), measured locally at ≈91% (outline) / 98% (symbol search) fewer tokens than reading the file; with setup recipes and the MCP-schema-overhead caveat.
52-qdrant-and-vector-databases.md — Qdrant/vector DB follow-up: vector search is an optional semantic-memory/RAG backend, not a replacement for fff or codedb. Default recommendation remains rust-analyzer + ast-grep + codedb + fff; pilot Qdrant only for docs/examples/decisions/pattern recall and accept it only if it beats that planned stack by ≥20% tokens per solved task at equal quality.
53-headroom-and-context-compression.md — deep dive on chopratejas/headroom (the input-side context-compression layer); the cross-tool comparison to the caveman ecosystem and RTK is consolidated in the dedicated token-optimization tools folder. Headroom compresses what the model reads (tool outputs/logs/RAG/files, the 61% cache buckets); caveman compresses what the model writes (prose, 17%) — orthogonal, they stack, neither touches thinking (20%). Headroom's live-zone design (stabilize the cached prefix, compress only the volatile tail) is the cache-safe input-compression design that refines the record-19/FL3 "no compressor in the hot path" kill; its "60–95%"/"96.2%" headlines are per-payload/double-counted and corrected here (K1-style). Verdict: pilot MCP mode as an A/B arm against existing hooks (record 20) + code-intelligence (51) + serialization (record 14); never default the whole-prompt proxy in a jackin' container.
54-context-compression-literature-and-market.md — the compression-layer internet re-sweep (other projects) + fresh literature (2024–2026), companion to 53. Headline: a cache-safety classification of every compression move (output brevity = cache-neutral; write-time observation compression = safe; whole-prompt input compression = breaks the cache, must beat ~10×). The frontier moved to code-domain, hosted-viable, write-time compressors that raise SWE-bench accuracy (Squeez, AgentDiet, SWEzze, SWE-Pruner, LongCodeZip) — refuting file 46's "no compressor safe for code." Stars are a PR artifact in this niche; rank by evidence. Credible challengers (the-complexity-trap, OpenHands batched condensation, ACON, llmtrim, claw-compactor) ranked by evidence, not stars.
55-token-observability-and-visualization.md — the observability layer (distinct from compression): a deep dive on alexgreensh/token-optimizer and a survey of full-per-token-visibility / session-visualization tools. token-optimizer reads Claude Code JSONL transcripts locally (no proxy, cache-safe) and renders the dossier's own per-turn input/output/cache-read/cache-write decomposition as a web dashboard + status line — it productizes tools/session_cost.py with a UI. Key limit: thinking stays invisible in any JSONL-only tool (must be inferred via count_tokens). Caveats: PolyForm-Noncommercial license; dollar views assume API pricing, not a Max subscription (file 41). The JSONL-reading, no-proxy class is the safe measurement front-end of the validation harness.
56-rtk-and-write-time-observation-compression.md — deep dive on rtk-ai/rtk ("Rust Token Killer") — the dossier's RTK record. The cross-tool comparison it originally carried now lives, expanded to four tools (adds lean-ctx), in the dedicated token-optimization tools folder (single source of truth). RTK is the deterministic, Claude-Code-native productization of the cache-safe write-time observation-compression design point files 53 (H1) and 54 named: it compresses shell-command output (tests/git/logs/builds) at the tool boundary via a PreToolUse hook — no ML, no MCP rent, cache-safe by construction — but reaches only Bash calls (not native Read/Grep). The "60–90%" is a per-command best case (no whole-session telemetry, no independent benchmark; 63.5k★ is PR-inflated per file 54 §A), corrected to low-double-digit whole-bill, same as the caveman K1 / headroom H-K1 moves. Verdict: caveman for output; RTK and headroom are complementary input-side layers (RTK = Bash output at the tool boundary, headroom = API-layer everything-else) the community stacks — a published month-long head-to-head measured RTK 1.33B + headroom 0.19B → 1.52B tokens, headroom at 96% prefix-cache-hit (confirming the live-zone design); adopt in risk/reach order. RTK is the most container-adoptable of the three, pilot it role-scoped with the host-write/hook-conflict guardrails. File 51's ast-grep coverage was also extended into a full verdict (structural-search token economics + the skill-vs-MCP-vs-CLI form-factor analysis).

Final completion audit

All 19 required §10 files exist; Volume II/III addenda are extra, not replacements.
Writing-rule checks passed: every Markdown report has an early TL;DR/summary surface; files 10–19 carry 110 technique records with all required fields.
Technique floors exceeded: ≥40 required, 110 in files 10–19; ≥15 complete records required, 110 complete; frontier floor exceeded with 16 K-ideas in 20 plus 8 Volume II ideas in 48.
Phase-0 audit complete: environment instruction mass, MCP schema overhead, caveman/wenyan tokenizer table, hook waste, and thinking-vs-visible decomposition are in 02.
Adversarial validation applied: the independent 50 pass found arithmetic/tokenizer/profile/cap issues; load-bearing corrections are now applied in the live summaries and affected reports.
Composed stacks and 10x verdict are current: 30 carries corrected ≈2.4–2.5x aggressive math, ≈5–6.2x validated-routing ceiling, and no defensible 10x at zero quality loss.
Negative-cost set, graveyards, harness, and roadmap are present: negative-cost set in 30, claim graveyards in 00/area files/49, runnable validation protocol in 31, adoption sequence in 32.
Evidence discipline holds by audit: external claims are cited with access dates or ledgers; local measurements name their method; bounded unknowns remain explicitly labeled INCOMPLETE.
All artifacts are committed and pushed to origin/chore/token-optimization; latest verification showed a clean worktree after pushed commits.

Addendum — Code Intelligence Tools

Focused live analysis requested after the final audit, then expanded with an internet re-sweep for alternatives: 51-code-intelligence-tools.md compares codedb, fff, the CodeGraff codedb article, the CodeGraff product/toolchain, and stronger alternatives such as Serena, Code Context Engine, Augment Context Engine, Sourcegraph MCP, Qodo Context Engine, Claude Context, and CodeGraphContext.

Verdict: these tools can save tokens only when they replace blind grep/read loops with bounded, precise retrieval. codedb has the strongest public token-saving case; fff has a strong latency case and plausible but unquantified token savings; Serena is the strongest local open-source semantic-navigation challenger, Code Context Engine has the strongest local open-source token-savings headline with baseline caveats, and Augment/Sourcegraph/Qodo are stronger commercial or enterprise context systems if vendor dependency is acceptable.
jackin' recommendation: keep the existing the-architect fff pilot, add a measured codedb A/B arm if MCP schema overhead is deferred or bounded, add Serena/Claude Context competitor arms where installable, include Code Context Engine in the token benchmark, and treat CodeGraff Pro/Augment/Sourcegraph/Qodo as explicit opt-in agent-stack experiments rather than default jackin-core dependencies.
Qdrant follow-up: 52-qdrant-and-vector-databases.md concludes Qdrant is a credible backend for semantic memory/RAG but should stay optional and scoped; a live re-check found Milvus/Zilliz, Vespa, Turbopuffer, LanceDB, Chroma, Pinecone, and pgvector are real alternatives, but none proves better coding-agent token economy than fff + codedb. The useful local case is a bounded hybrid docs/decision index over the repo's large documentation surface, not default code navigation. Qdrant should not become a default third tool unless a harness proves a ≥20% token-per-solved-task reduction against the planned stack.

Token-Optimization Research Dossier

On this page