Token-Optimization Research Brief

How to run this file: /goal Follow token-optimization.md (see the Operating Rules section for the exact constraints of the run). You — the agent reading this — are the researcher. This brief is your full specification. Execute it end to end without asking the operator questions.

1. Mission

Produce the definitive research dossier on extreme token optimization for coding-agent usage — techniques that go an order of magnitude beyond what is common practice today, without any loss in the quality of what the agent produces. The deliverable is a folder, token-optimization/ at the repository root, containing the reports specified in §10.

The bar is explicitly not "10% better." The bar is: after reading this dossier, the operator can assemble a stack of techniques whose combined effect approaches a 10x reduction in dollars spent per completed task at equal task-success quality — and every claim in the dossier is either evidence-backed, locally reproduced, or honestly labeled speculative. If a true 10x is not achievable, the dossier must say exactly where the wall is, with arithmetic, and what the best achievable number is.

The audience is a single power user of Claude Code (and similar coding agents) who already uses aggressive style compression (the caveman plugin) and wants to know what lies beyond it: what is shipping on the market, what is proven in research, what is theoretically possible, and what sounds unrealistic but is actually buildable today.

2. Role and mindset

Act as an AI-efficiency researcher writing for a technical operator. Your job is to map the entire landscape, then push past it:

Be exhaustive first, opinionated second. Catalog everything; then rank ruthlessly.
Hunt for "negative-cost" techniques — ones that save tokens and improve output quality (e.g., context pruning may counteract long-context degradation). These are the genuinely unbelievable category; flag every one you find.
Treat every claimed saving as a suspect until verified. Token-saving folklore is full of numbers that don't survive a tokenizer check. Part of your job is killing bad claims.
Prefer measurement over citation when measurement is cheap. You have a live environment and the count_tokens API; many claims can be reproduced in minutes. Do it.
Distinguish what a standard user can do today (config, prompts, plugins, skills, hooks, session habits) from what requires the SDK, a proxy/gateway, self-hosting, or provider action. All tiers are in scope, but each technique must be labeled with its availability tier.

3. Hard constraints (non-negotiable)

Zero quality loss is the floor. Every technique must carry an explicit quality-risk analysis: what could silently degrade, how it would manifest, and a concrete experiment that would expose it. A technique without a falsification plan is incomplete. Techniques that trade quality for tokens may be cataloged, but must be clearly marked QUALITY-TRADE and excluded from the recommended stacks.
No invented numbers. Every quantitative claim is either (a) cited to a dated source, (b) reproduced by you during this run with the method shown, or (c) labeled ESTIMATE with the arithmetic written out. Nothing else.
Evidence tiers on everything. T1 = shipped product with published or locally-reproduced measurements. T2 = peer-reviewed research. T3 = community-replicated (multiple independent writeups with numbers). T4 = theoretical/speculative (math is plausible, nobody has shown it).
Date-stamp the research. Provider features, pricing, and betas churn. Every report carries "Research conducted: <date of the run>" and every external source carries an access date. Verify pricing and feature availability against live documentation during the run — do not trust training data for numbers.
Optimize dollars, not just tokens. Cache pricing makes these different units. The target metric is $ per completed task at equal success rate, with tokens as the supporting detail.

4. What "10x" means precisely

Define the baseline as: a competent Claude Code user on a frontier model, with default settings, no style compression, default context habits. Define the optimized state as: the same user, same tasks, same model tier (or a routing mix that preserves quality), running the recommended stack. The claim "Nx" means: (baseline $ per completed task) / (optimized $ per completed task) ≥ N, at statistically indistinguishable task-success rates on the validation suite (§9). All stack math in the dossier uses this definition. Show the modeled session profile you use for the arithmetic (a realistic distribution of input/output/cache-read/cache-write tokens for a heavy coding day) and state its assumptions.

5. Baseline — what is already known and deployed (do not re-discover; do re-measure)

The operator's current environment already uses, and the dossier should treat as the starting line:

Style-layer compression (caveman plugin): levels lite/full/ultra and wenyan-lite/full/ultra (classical Chinese register). Claimed ~75% reduction of conversational output tokens at ultra, 80–90% character reduction at wenyan-full. These claims are inputs to verify, not facts.
Compressed subagents (cavecrew): investigator/builder/reviewer subagents whose reports are caveman-compressed before re-entering the main context (~60% smaller tool results claimed).
Memory-file compression (caveman-compress): CLAUDE.md and similar always-loaded files compressed into caveman register.
Prompt caching: automatic in Claude Code; the dominant reason long sessions are affordable.
Context summarization (/compact and auto-compaction), memory files, deferred MCP tool schemas via ToolSearch, subagent fan-out, effort levels, model selection.

Two sharp known caveats you must address head-on, because they decide whether the baseline numbers are even real:

Thinking tokens bill as output tokens and are invisible in the chat transcript. A style layer that compresses visible prose does nothing to thinking. Measure what fraction of output tokens is thinking in realistic sessions; if it dominates, the true saving of style compression is far below its face value, and the dossier must report the corrected number. This single measurement may reorder the entire tier list.
Exotic-script compression must survive the tokenizer. Wenyan/CJK and symbol substitutions can cost more BPE tokens per character. Verify with the count_tokens endpoint against the current model on real samples (English vs caveman-ultra vs wenyan variants of the same content). Publish the table.

6. The economics (grounding; re-verify live during the run)

Cached reference numbers as of 2026-06 — confirm against live docs before using in arithmetic:

Model	Input $/MTok	Output $/MTok
Claude Fable 5 (`claude-fable-5`)	$10.00	$50.00
Claude Opus 4.8	$5.00	$25.00
Claude Sonnet 4.6	$3.00	$15.00
Claude Haiku 4.5	$1.00	$5.00

Structural facts that shape every lever (verify, then build on):

Output costs ~5x input across the lineup, and thinking bills as output. Output-side discipline is disproportionately valuable per token.
Cache reads cost ~0.1x base input; cache writes 1.25x (5-minute TTL) or 2x (1-hour TTL). Caching is a prefix match: one changed byte invalidates everything after it. Render order is tools → system → messages. Minimum cacheable prefix is model-dependent (e.g. 2048 tokens on Fable 5, 4096 on Opus 4.8). Max 4 breakpoints; 20-block lookback window.
A >5-minute idle gap can expire the cache, turning the next turn into a full re-write of the entire conversation prefix at 1.25x. Idle-pattern economics (and countermeasures like pre-warming with max_tokens: 0, or 1h TTL) are a real, under-discussed lever.
Batch API is 50% off for latency-tolerant work.
The token-efficient-tools-2025-02-19 beta is built into all Claude 4+ models — it is not an available lever anymore; do not recommend it. Check for newer equivalents instead.
Total prompt size = input_tokens + cache_creation_input_tokens + cache_read_input_tokens in the API usage object; these fields are the ground truth for all measurement.

7. Research questions the dossier must answer

Where does the money actually go? Measured decomposition of a heavy Claude Code session: uncached input vs cache writes vs cache reads vs visible output vs thinking output, per turn and per session. Without this, every optimization is aimed blind.
What is the theoretical floor? For a given task and quality level, how few tokens are irreducibly needed (information-theoretic framing: the entropy of the task-relevant information vs the redundancy of how agents currently transmit it)? How close do the best techniques get?
Which techniques are negative-cost (save tokens AND improve quality), which are free (no quality effect), and which silently degrade?
What is the best composed stack, with multiplication carried through honestly (savings on different token classes don't multiply naively), and what total $ reduction does it achieve on the modeled profile? Is 10x real? Where is the wall?
What exists on the market and in research that ordinary operators don't use yet — shipped tools, papers with code, proxy/gateway products, semantic caches, prompt compressors?
What can be made automatic (hooks, skills, plugin behavior, defaults baked into an orchestrator such as this repository's jackin' project, which launches coding agents in containers) versus what requires ongoing user discipline? Automatic beats disciplined.

8. Research areas — the map (one report file per area; seeded leads are starting points, not limits)

A. Style and language compression — beyond caveman (10-style-and-language-compression.md) Where is the ceiling of register compression? Telegraphic English vs caveman-ultra vs classical Chinese vs constructed notations (math/logic symbols, APL-like dialects). In-context codebook bootstrapping: define a session dialect/abbreviation table once (~hundreds of tokens), amortize over thousands of turns — does comprehension survive? Compression of instructions to the model vs output from the model as separate problems. The thinking-token caveat from §5 belongs here. Measure everything with count_tokens.

B. Tokenizer arbitrage (11-tokenizer-arbitrage.md) Information density per BPE token across languages and scripts (seed: "All languages are NOT created (tokenized) equal", tokenizer-fairness literature). Which scripts/symbols are cheap vs deceptively expensive for the current Claude tokenizer? Identifier and format choice: JSON vs YAML vs TOON (Token-Oriented Object Notation) vs CSV vs custom DSV for structured data — measured tokens-per-record tables. Numbers, hex, base64, UUIDs as token sinks. Verdict on whether wenyan's character reduction survives as token reduction.

C. Context architecture — what enters the window at all (12-context-architecture.md) The biggest input lever: don't send it. Repo maps instead of file dumps (seed: aider's tree-sitter + PageRank repo map; ctags outlines), partial reads (offset/limit), grep-first navigation, symbol-graph/AST context instead of raw source, multi-resolution context (summaries holding pointers that expand on demand), observation masking / tool-result eviction, conversation pruning vs summarization (seed: "Lost in the Middle" — pruning as a quality-improving move), system-prompt and CLAUDE.md mass audits, MCP schema overhead (measure how many tokens connected MCP servers cost per request and what schema deferral saves), skills-as-lazy-loading.

D. Caching exploitation (13-caching-exploitation.md) Squeeze the 0.1x read price for everything it's worth: stable-prefix discipline (what silently busts Claude Code's cache in practice — editing CLAUDE.md mid-session, toggling MCP servers, model switches), session-structure habits that maximize hit rate, the idle-gap problem and keepalive / pre-warming strategies (max_tokens: 0 warmers; 1h TTL trade-offs: 2x write needs ≥3 reads to win), scheduling work within cache windows, subagent cache economics (each spawn re-writes its prefix — when does fan-out cost more than it saves?), batch-API + caching for offline work.

E. Retrieval, memory, and state offloading (14-retrieval-memory-and-state.md) Replace re-derivation and re-reading with recall: persistent memory files, the API memory tool, MemGPT/Letta-style hierarchical memory, RAG over the repo vs agentic search (the industry argument both ways, with evidence), semantic caching of whole responses (seed: GPTCache), file-based state so a new session resumes from a 2-KB state file instead of a 200-KB transcript replay, embedding pointers ("see decision #14") vs restating.

F. Output discipline and structured generation (15-output-discipline.md) The 5x-priced token class. Diff/patch-style edits vs whole-file rewrites (measure Edit-tool patches vs Write-tool rewrites on real diffs), structured-output schemas to eliminate rambling, stop sequences, effort parameter sweeps (low/medium/high/xhigh/max) — measured quality-vs-output-tokens curves per task class, suppressing restated code in explanations, terse-by-instruction vs terse-by-persona, and the limits: when does forced terseness start dropping load-bearing caveats (negation loss, skipped warnings)?

G. Model routing and tiered delegation (16-model-routing-and-delegation.md) Right model per subtask: Haiku/Sonnet subagents for grunt work under a Fable/Opus orchestrator, draft-with-cheap / verify-with-expensive patterns, effort-level routing as intra-model tiering, measured quality deltas per task class (when does Haiku exploration actually hurt?), cost math for typical fan-outs, and routing products on the market (gateways, RouteLLM-style routers) with evidence quality.

H. Multi-agent protocols and subagent compression (17-multi-agent-protocols.md) Agent-to-agent communication doesn't need human-readable English: compressed report formats (cavecrew as baseline), schema-constrained interchange, learned pidgins and emergent-communication research (seed: DroidSpeak — LLM-to-LLM communication via KV-cache/activation exchange), orchestrator context hygiene (what the main thread actually needs back), parallel-vs-serial token economics, and where compressed interchange starts corrupting information transfer.

I. Provider-level features, current and beta (18-provider-features.md) Sweep the current Anthropic feature surface for token-relevant features and quantify each: prompt caching, context editing (server-side clearing of stale tool results/thinking), server-side compaction, the memory tool, structured outputs, effort, task budgets, batch API, programmatic tool calling (intermediate tool results that never enter context), tool search, fine-grained streaming. Same sweep, briefer, for the obvious comparison points (OpenAI prompt caching, Gemini context caching) to triangulate what's possible. Everything verified against live docs with dates.

J. Infrastructure-level (self-hosted / gateway tier) (19-infrastructure-level.md) For the advanced operator: vLLM/SGLang automatic prefix caching and RadixAttention, KV-cache reuse and cross-request sharing, LLMLingua/LongLLMLingua/LLMLingua-2 as a compression proxy in front of an API (does a small-model compressor preserve coding-agent quality? evidence), local draft models, gateway-level dedup/semantic caching, and an honest section on what is NOT user-accessible on a hosted API (speculative decoding, distillation) so the operator stops chasing it.

K. Frontier — "unrealistic but maybe real" (20-frontier-ideas.md) The crazy file. At least 10 ideas, each worked through mechanism → savings math → feasibility verdict (REAL-NOW / BUILDABLE / RESEARCH-STAGE / BLOCKED-BY-<x> / PHYSICS-SAYS-NO). Seeds — extend, don't stop at: learned session codebooks negotiated at boot; self-extending abbreviation registries maintained in memory files; gist tokens / In-context Autoencoder / AutoCompressors / activation beacons / 500xCompressor (research-stage context distillation — what would it take to use today?); context distillation into fine-tunes of repeated instructions; token-aligned identifier naming (choosing names that tokenize to 1 token) applied codebase-wide; prompt-as-program (run-length encoding repeated structure); single-token semantic anchors replacing recurring multi-token phrases; cache-keepalive daemons; "compression hygiene" linters that fail CI when always-loaded context grows; an orchestrator (jackin') that bakes the entire stack into every container it launches so optimization becomes infrastructure instead of discipline.

L. Measurement tooling (folded into 01-economics-and-measurement.md) How to see token spend at all: API usage fields, Claude Code /cost and OTel metrics, session JSONL transcripts as a measurement source, ccusage-style analyzers, count_tokens as the experiment instrument, building a personal token dashboard. The dossier's own experiments must be reproducible with these tools.

9. Method

Run as phases. Use whatever parallel machinery is authorized for the run (deep-research skill, workflows, parallel subagents) — fan out wide on search, verify adversarially, synthesize centrally.

Phase 0 — Baseline audit (local, no web needed). Measure this very environment: token mass of the always-loaded context (CLAUDE.md/agent rule chain, memory files), connected MCP servers and their schema overhead, caveman plugin configuration. Run the §5 verification experiments (caveman/wenyan tokenizer check via count_tokens; thinking-vs-visible output decomposition from recent session transcripts if available). Produce 02-baseline-audit.md with real numbers.
Phase 1 — Landscape sweep. Parallel web research across areas A–K. Collect candidate techniques with sources. Breadth over depth; every claim tagged with its provisional evidence tier.
Phase 2 — Deep dives. For the strongest candidates (at least 15), build the full per-technique record (§10 schema). Where a claim is locally checkable in minutes, check it locally instead of citing it.
Phase 3 — Adversarial validation. For every headline number that will appear in the executive summary: attempt to refute it (independent skeptic pass — tokenize the actual example, find the contradicting source, re-run the arithmetic). Kill or downgrade anything that doesn't survive. Document the kills — a graveyard section of popular-but-false savings claims is itself a deliverable.
Phase 4 — Synthesis. Build the three composed stacks (§11) with end-to-end dollar math on the modeled session profile. Identify the negative-cost set. Write the executive summary last.
Phase 5 — Completeness critic + roadmap. A final pass asking "what's missing — an area not swept, a claim unverified, a file unwritten?" Whatever it finds becomes the last round of work. Then write 32-adoption-roadmap.md.

10. Deliverables

Create token-optimization/ at the repository root:

token-optimization/
├── README.md                            # index, how to read, the tier list, headline numbers, research date
├── 00-executive-summary.md              # findings in 2 pages: the stack, the math, the verdict on 10x
├── 01-economics-and-measurement.md      # token economics, measurement tooling, the modeled session profile
├── 02-baseline-audit.md                 # Phase-0 measurements of this environment + caveman verification
├── 03-prior-art-and-market-scan.md      # shipped products, papers, tools — the landscape with evidence tiers
├── 10-style-and-language-compression.md
├── 11-tokenizer-arbitrage.md
├── 12-context-architecture.md
├── 13-caching-exploitation.md
├── 14-retrieval-memory-and-state.md
├── 15-output-discipline.md
├── 16-model-routing-and-delegation.md
├── 17-multi-agent-protocols.md
├── 18-provider-features.md
├── 19-infrastructure-level.md
├── 20-frontier-ideas.md                 # ≥10 ideas with feasibility verdicts
├── 30-composed-stacks.md                # Conservative / Aggressive / Unbelievable stacks with full math
├── 31-validation-harness.md             # the no-quality-loss proof protocol, runnable
└── 32-adoption-roadmap.md               # day-1 / week-1 / month-1; manual vs automatable (hooks, skills, jackin')

Per-technique record schema (every technique in files 10–20 uses exactly this structure):

Name and one-line pitch
Layer: input / output / cache / turn-structure / infra
Mechanism: why it saves tokens, in plain language
Expected savings: % on which token class, with arithmetic on the modeled session profile
Evidence tier: T1–T4, with sources (URL + access date) or local reproduction method
Quality risk: failure modes, how degradation would manifest, and the falsification experiment; verdict: NEGATIVE-COST / NEUTRAL / RISKY / QUALITY-TRADE
Availability: CLAUDE-CODE-TODAY / SDK / GATEWAY-OR-SELF-HOST / NOT-USER-ACCESSIBLE
Effort to adopt: minutes / hours / days / project
Composability: which other techniques it stacks with, and any anti-synergies
Validation protocol: the exact experiment proving no quality loss for this technique

Writing rules: every file opens with a TL;DR of ≤5 bullets including the headline numbers; tables for enumerable comparisons; prose for reasoning; no claim without its tier; plain language — the operator should never need to decode jargon.

11. Scoring and the stacks

Score every technique: value = expected $ saving on the modeled profile × confidence (by evidence tier) ÷ adoption effort, then place it on a tier list (S/A/B/C/F) in the README. Build three stacks in 30-composed-stacks.md:

Conservative — T1/T2 only, CLAUDE-CODE-TODAY only, zero quality risk. The "do this tomorrow" stack.
Aggressive — adds T3 and SDK/gateway-tier techniques with validation plans attached.
Unbelievable — everything defensible including BUILDABLE frontier ideas; the stack that chases the full 10x. State its total honestly even if it falls short, and name the binding constraint.

For each stack: composed dollar math on the modeled profile (compose savings per token class, then convert to dollars — never multiply percentages across different token classes), the validation plan that proves quality is preserved, and the failure modes of the composition itself.

12. Validation harness (`31-validation-harness.md`)

Design a runnable protocol the operator can use to verify any technique or stack:

A paired-task benchmark: same tasks run with and without the technique, n ≥ 10 per arm, measuring task success, $ per task, tokens by class, and turns to completion.
Canary tasks targeting known compression failure modes: negation preservation, ordering- sensitive instructions, numeric precision, "don't do X" constraints, multi-step sequences.
Quality judging method (objective checks where possible — tests pass, diff correct; LLM-judge with rubric where not), and the statistical bar for declaring "no quality loss."
Where to read the numbers: API usage fields, /cost, OTel, transcript JSONL.

13. Operating rules for the run

Autonomy: do not ask the operator questions. When a judgment call arises, make it, and record it in an Assumptions section of the research index.
Git discipline — commit and push on every finding: all work happens on the chore/token-optimization branch. Never switch branches and never commit to main. Every time you complete or meaningfully update an artifact — a finished report file, a recorded measurement table, a phase milestone, a README/tier-list update — commit it and push to origin immediately, in the same step. Do not batch hours of work into one commit: the operator follows the run live through pushed commits, and an unpushed artifact is invisible progress. Do not run, wait for, or verify CI/CD on these pushes — commit and push is the entire loop. Follow CONTRIBUTING.md: Conventional Commits subject (docs(research): <what landed>) and DCO sign-off via git commit -s. Do not modify any pre-existing repository file; your writes are limited to token-optimization/.
Citations: every external claim → source URL + access date. Every local measurement → the command or method that produced it.
Live verification: pricing, feature availability, and beta status checked against live provider documentation during the run, not recalled from training data.
Scope honesty: if an area cannot be completed to the bar, ship the file with an explicit INCOMPLETE banner naming what's missing — never silently thin it out.

Token-Optimization Research Brief

Token-Optimization Research Brief

1. Mission

2. Role and mindset

3. Hard constraints (non-negotiable)

4. What "10x" means precisely

5. Baseline — what is already known and deployed (do not re-discover; do re-measure)

6. The economics (grounding; re-verify live during the run)

7. Research questions the dossier must answer

8. Research areas — the map (one report file per area; seeded leads are starting points, not limits)

9. Method

10. Deliverables

11. Scoring and the stacks

12. Validation harness (`31-validation-harness.md`)

13. Operating rules for the run

14. Definition of done

On this page