# Token-Optimization Research Brief (https://jackin.tailrocks.com/research/token-optimization/prompt/)


# Token-Optimization Research Brief [#token-optimization-research-brief]

> **How to run this file:** `/goal Follow token-optimization.md` (see the Operating Rules
> section for the exact constraints of the run). You — the agent reading this — are the researcher.
> This brief is your full specification. Execute it end to end without asking the operator questions.

***

## 1. Mission [#1-mission]

Produce the definitive research dossier on **extreme token optimization for coding-agent usage** —
techniques that go an order of magnitude beyond what is common practice today, **without any loss in
the quality of what the agent produces**. The deliverable is a folder, `token-optimization/`
at the repository root, containing the reports specified in §10.

The bar is explicitly not "10% better." The bar is: after reading this dossier, the operator can
assemble a stack of techniques whose **combined effect approaches a 10x reduction in dollars spent
per completed task at equal task-success quality** — and every claim in the dossier is either
evidence-backed, locally reproduced, or honestly labeled speculative. If a true 10x is not
achievable, the dossier must say exactly where the wall is, with arithmetic, and what the best
achievable number is.

The audience is a single power user of Claude Code (and similar coding agents) who already uses
aggressive style compression (the caveman plugin) and wants to know what lies beyond it: what is
shipping on the market, what is proven in research, what is theoretically possible, and what sounds
unrealistic but is actually buildable today.

## 2. Role and mindset [#2-role-and-mindset]

Act as an AI-efficiency researcher writing for a technical operator. Your job is to map the entire
landscape, then push past it:

* **Be exhaustive first, opinionated second.** Catalog everything; then rank ruthlessly.
* **Hunt for "negative-cost" techniques** — ones that save tokens *and improve* output quality
  (e.g., context pruning may counteract long-context degradation). These are the genuinely
  unbelievable category; flag every one you find.
* **Treat every claimed saving as a suspect until verified.** Token-saving folklore is full of
  numbers that don't survive a tokenizer check. Part of your job is killing bad claims.
* **Prefer measurement over citation when measurement is cheap.** You have a live environment and
  the `count_tokens` API; many claims can be reproduced in minutes. Do it.
* **Distinguish what a standard user can do today** (config, prompts, plugins, skills, hooks,
  session habits) from what requires the SDK, a proxy/gateway, self-hosting, or provider action.
  All tiers are in scope, but each technique must be labeled with its availability tier.

## 3. Hard constraints (non-negotiable) [#3-hard-constraints-non-negotiable]

1. **Zero quality loss is the floor.** Every technique must carry an explicit quality-risk analysis:
   what could silently degrade, how it would manifest, and a concrete experiment that would expose
   it. A technique without a falsification plan is incomplete. Techniques that trade quality for
   tokens may be cataloged, but must be clearly marked `QUALITY-TRADE` and excluded from the
   recommended stacks.
2. **No invented numbers.** Every quantitative claim is either (a) cited to a dated source,
   (b) reproduced by you during this run with the method shown, or (c) labeled `ESTIMATE` with the
   arithmetic written out. Nothing else.
3. **Evidence tiers on everything.** T1 = shipped product with published or locally-reproduced
   measurements. T2 = peer-reviewed research. T3 = community-replicated (multiple independent
   writeups with numbers). T4 = theoretical/speculative (math is plausible, nobody has shown it).
4. **Date-stamp the research.** Provider features, pricing, and betas churn. Every report carries
   "Research conducted: \<date of the run>" and every external source carries an access date. Verify
   pricing and feature availability against live documentation during the run — do not trust
   training data for numbers.
5. **Optimize dollars, not just tokens.*&#x2A; Cache pricing makes these different units. The target
   metric is **$ per completed task at equal success rate**, with tokens as the supporting detail.

## 4. What "10x" means precisely [#4-what-10x-means-precisely]

Define the baseline as: a competent Claude Code user on a frontier model, with default settings, no
style compression, default context habits. Define the optimized state as: the same user, same tasks,
same model tier (or a routing mix that preserves quality), running the recommended stack. The claim
"Nx" means: `(baseline $ per completed task) / (optimized $ per completed task) ≥ N`, at
statistically indistinguishable task-success rates on the validation suite (§9). All stack math in
the dossier uses this definition. Show the modeled session profile you use for the arithmetic (a
realistic distribution of input/output/cache-read/cache-write tokens for a heavy coding day) and
state its assumptions.

## 5. Baseline — what is already known and deployed (do not re-discover; do re-measure) [#5-baseline--what-is-already-known-and-deployed-do-not-re-discover-do-re-measure]

The operator's current environment already uses, and the dossier should treat as the starting line:

* **Style-layer compression (caveman plugin):** levels lite/full/ultra and wenyan-lite/full/ultra
  (classical Chinese register). Claimed \~75% reduction of conversational output tokens at ultra,
  80–90% character reduction at wenyan-full. These claims are *inputs to verify*, not facts.
* **Compressed subagents (cavecrew):** investigator/builder/reviewer subagents whose reports are
  caveman-compressed before re-entering the main context (\~60% smaller tool results claimed).
* **Memory-file compression (caveman-compress):** CLAUDE.md and similar always-loaded files
  compressed into caveman register.
* **Prompt caching:** automatic in Claude Code; the dominant reason long sessions are affordable.
* **Context summarization (/compact and auto-compaction), memory files, deferred MCP tool schemas
  via ToolSearch, subagent fan-out, effort levels, model selection.**

Two sharp known caveats you must address head-on, because they decide whether the baseline numbers
are even real:

* **Thinking tokens bill as output tokens** and are invisible in the chat transcript. A style layer
  that compresses visible prose does nothing to thinking. Measure what fraction of output tokens is
  thinking in realistic sessions; if it dominates, the true saving of style compression is far below
  its face value, and the dossier must report the corrected number. This single measurement may
  reorder the entire tier list.
* **Exotic-script compression must survive the tokenizer.** Wenyan/CJK and symbol substitutions can
  cost *more* BPE tokens per character. Verify with the `count_tokens` endpoint against the current
  model on real samples (English vs caveman-ultra vs wenyan variants of the same content). Publish
  the table.

## 6. The economics (grounding; re-verify live during the run) [#6-the-economics-grounding-re-verify-live-during-the-run]

Cached reference numbers as of 2026-06 — confirm against live docs before using in arithmetic:

| Model                             | Input $/MTok | Output $/MTok |
| --------------------------------- | ------------ | ------------- |
| Claude Fable 5 (`claude-fable-5`) | $10.00       | $50.00        |
| Claude Opus 4.8                   | $5.00        | $25.00        |
| Claude Sonnet 4.6                 | $3.00        | $15.00        |
| Claude Haiku 4.5                  | $1.00        | $5.00         |

Structural facts that shape every lever (verify, then build on):

* **Output costs \~5x input** across the lineup, and thinking bills as output. Output-side discipline
  is disproportionately valuable per token.
* **Cache reads cost \~0.1x base input; cache writes 1.25x (5-minute TTL) or 2x (1-hour TTL).**
  Caching is a prefix match: one changed byte invalidates everything after it. Render order is
  tools → system → messages. Minimum cacheable prefix is model-dependent (e.g. 2048 tokens on
  Fable 5, 4096 on Opus 4.8). Max 4 breakpoints; 20-block lookback window.
* **A >5-minute idle gap can expire the cache**, turning the next turn into a full re-write of the
  entire conversation prefix at 1.25x. Idle-pattern economics (and countermeasures like pre-warming
  with `max_tokens: 0`, or 1h TTL) are a real, under-discussed lever.
* **Batch API is 50% off** for latency-tolerant work.
* **The `token-efficient-tools-2025-02-19` beta is built into all Claude 4+ models** — it is not an
  available lever anymore; do not recommend it. Check for newer equivalents instead.
* Total prompt size = `input_tokens + cache_creation_input_tokens + cache_read_input_tokens` in the
  API usage object; these fields are the ground truth for all measurement.

## 7. Research questions the dossier must answer [#7-research-questions-the-dossier-must-answer]

1. **Where does the money actually go?** Measured decomposition of a heavy Claude Code session:
   uncached input vs cache writes vs cache reads vs visible output vs thinking output, per turn and
   per session. Without this, every optimization is aimed blind.
2. **What is the theoretical floor?** For a given task and quality level, how few tokens are
   irreducibly needed (information-theoretic framing: the entropy of the task-relevant information
   vs the redundancy of how agents currently transmit it)? How close do the best techniques get?
3. **Which techniques are negative-cost** (save tokens AND improve quality), which are free
   (no quality effect), and which silently degrade?
4. **What is the best composed stack**, with multiplication carried through honestly (savings on
   different token classes don't multiply naively), and what total $ reduction does it achieve on
   the modeled profile? Is 10x real? Where is the wall?
5. **What exists on the market and in research that ordinary operators don't use yet** — shipped
   tools, papers with code, proxy/gateway products, semantic caches, prompt compressors?
6. **What can be made automatic** (hooks, skills, plugin behavior, defaults baked into an
   orchestrator such as this repository's jackin' project, which launches coding agents in
   containers) versus what requires ongoing user discipline? Automatic beats disciplined.

## 8. Research areas — the map (one report file per area; seeded leads are starting points, not limits) [#8-research-areas--the-map-one-report-file-per-area-seeded-leads-are-starting-points-not-limits]

**A. Style and language compression — beyond caveman** (`10-style-and-language-compression.md`)
Where is the ceiling of register compression? Telegraphic English vs caveman-ultra vs classical
Chinese vs constructed notations (math/logic symbols, APL-like dialects). In-context codebook
bootstrapping: define a session dialect/abbreviation table once (\~hundreds of tokens), amortize over
thousands of turns — does comprehension survive? Compression of *instructions to* the model vs
*output from* the model as separate problems. The thinking-token caveat from §5 belongs here.
Measure everything with `count_tokens`.

**B. Tokenizer arbitrage** (`11-tokenizer-arbitrage.md`)
Information density per BPE token across languages and scripts (seed: "All languages are NOT
created (tokenized) equal", tokenizer-fairness literature). Which scripts/symbols are cheap vs
deceptively expensive for the current Claude tokenizer? Identifier and format choice: JSON vs YAML
vs TOON (Token-Oriented Object Notation) vs CSV vs custom DSV for structured data — measured
tokens-per-record tables. Numbers, hex, base64, UUIDs as token sinks. Verdict on whether wenyan's
character reduction survives as token reduction.

**C. Context architecture — what enters the window at all** (`12-context-architecture.md`)
The biggest input lever: don't send it. Repo maps instead of file dumps (seed: aider's tree-sitter +
PageRank repo map; ctags outlines), partial reads (offset/limit), grep-first navigation,
symbol-graph/AST context instead of raw source, multi-resolution context (summaries holding pointers
that expand on demand), observation masking / tool-result eviction, conversation pruning vs
summarization (seed: "Lost in the Middle" — pruning as a *quality-improving* move), system-prompt
and CLAUDE.md mass audits, MCP schema overhead (measure how many tokens connected MCP servers cost
per request and what schema deferral saves), skills-as-lazy-loading.

**D. Caching exploitation** (`13-caching-exploitation.md`)
Squeeze the 0.1x read price for everything it's worth: stable-prefix discipline (what silently busts
Claude Code's cache in practice — editing CLAUDE.md mid-session, toggling MCP servers, model
switches), session-structure habits that maximize hit rate, the idle-gap problem and keepalive /
pre-warming strategies (`max_tokens: 0` warmers; 1h TTL trade-offs: 2x write needs ≥3 reads to win),
scheduling work within cache windows, subagent cache economics (each spawn re-writes its prefix —
when does fan-out cost more than it saves?), batch-API + caching for offline work.

**E. Retrieval, memory, and state offloading** (`14-retrieval-memory-and-state.md`)
Replace re-derivation and re-reading with recall: persistent memory files, the API memory tool,
MemGPT/Letta-style hierarchical memory, RAG over the repo vs agentic search (the industry argument
both ways, with evidence), semantic caching of whole responses (seed: GPTCache), file-based state so
a new session resumes from a 2-KB state file instead of a 200-KB transcript replay, embedding
pointers ("see decision #14") vs restating.

**F. Output discipline and structured generation** (`15-output-discipline.md`)
The 5x-priced token class. Diff/patch-style edits vs whole-file rewrites (measure Edit-tool patches
vs Write-tool rewrites on real diffs), structured-output schemas to eliminate rambling, stop
sequences, effort parameter sweeps (low/medium/high/xhigh/max) — measured quality-vs-output-tokens
curves per task class, suppressing restated code in explanations, terse-by-instruction vs
terse-by-persona, and the limits: when does forced terseness start dropping load-bearing caveats
(negation loss, skipped warnings)?

**G. Model routing and tiered delegation** (`16-model-routing-and-delegation.md`)
Right model per subtask: Haiku/Sonnet subagents for grunt work under a Fable/Opus orchestrator,
draft-with-cheap / verify-with-expensive patterns, effort-level routing as intra-model tiering,
measured quality deltas per task class (when does Haiku exploration actually hurt?), cost math for
typical fan-outs, and routing products on the market (gateways, RouteLLM-style routers) with
evidence quality.

**H. Multi-agent protocols and subagent compression** (`17-multi-agent-protocols.md`)
Agent-to-agent communication doesn't need human-readable English: compressed report formats
(cavecrew as baseline), schema-constrained interchange, learned pidgins and emergent-communication
research (seed: DroidSpeak — LLM-to-LLM communication via KV-cache/activation exchange), orchestrator
context hygiene (what the main thread actually needs back), parallel-vs-serial token economics, and
where compressed interchange starts corrupting information transfer.

**I. Provider-level features, current and beta** (`18-provider-features.md`)
Sweep the current Anthropic feature surface for token-relevant features and quantify each: prompt
caching, context editing (server-side clearing of stale tool results/thinking), server-side
compaction, the memory tool, structured outputs, effort, task budgets, batch API, programmatic tool
calling (intermediate tool results that never enter context), tool search, fine-grained streaming.
Same sweep, briefer, for the obvious comparison points (OpenAI prompt caching, Gemini context
caching) to triangulate what's possible. Everything verified against live docs with dates.

**J. Infrastructure-level (self-hosted / gateway tier)** (`19-infrastructure-level.md`)
For the advanced operator: vLLM/SGLang automatic prefix caching and RadixAttention, KV-cache reuse
and cross-request sharing, LLMLingua/LongLLMLingua/LLMLingua-2 as a compression proxy in front of an
API (does a small-model compressor preserve coding-agent quality? evidence), local draft models,
gateway-level dedup/semantic caching, and an honest section on what is NOT user-accessible on a
hosted API (speculative decoding, distillation) so the operator stops chasing it.

**K. Frontier — "unrealistic but maybe real"** (`20-frontier-ideas.md`)
The crazy file. At least 10 ideas, each worked through mechanism → savings math → feasibility
verdict (`REAL-NOW` / `BUILDABLE` / `RESEARCH-STAGE` / `BLOCKED-BY-&lt;x&gt;` / `PHYSICS-SAYS-NO`).
Seeds — extend, don't stop at: learned session codebooks negotiated at boot; self-extending
abbreviation registries maintained in memory files; gist tokens / In-context Autoencoder /
AutoCompressors / activation beacons / 500xCompressor (research-stage context distillation — what
would it take to use today?); context distillation into fine-tunes of repeated instructions;
token-aligned identifier naming (choosing names that tokenize to 1 token) applied codebase-wide;
prompt-as-program (run-length encoding repeated structure); single-token semantic anchors replacing
recurring multi-token phrases; cache-keepalive daemons; "compression hygiene" linters that fail CI
when always-loaded context grows; an orchestrator (jackin') that bakes the entire stack into every
container it launches so optimization becomes infrastructure instead of discipline.

**L. Measurement tooling** (folded into `01-economics-and-measurement.md`)
How to see token spend at all: API usage fields, Claude Code `/cost` and OTel metrics, session
JSONL transcripts as a measurement source, ccusage-style analyzers, `count_tokens` as the
experiment instrument, building a personal token dashboard. The dossier's own experiments must be
reproducible with these tools.

## 9. Method [#9-method]

Run as phases. Use whatever parallel machinery is authorized for the run (deep-research skill,
workflows, parallel subagents) — fan out wide on search, verify adversarially, synthesize centrally.

* **Phase 0 — Baseline audit (local, no web needed).** Measure this very environment: token mass of
  the always-loaded context (CLAUDE.md/agent rule chain, memory files), connected MCP servers and
  their schema overhead, caveman plugin configuration. Run the §5 verification experiments
  (caveman/wenyan tokenizer check via `count_tokens`; thinking-vs-visible output decomposition from
  recent session transcripts if available). Produce `02-baseline-audit.md` with real numbers.
* **Phase 1 — Landscape sweep.** Parallel web research across areas A–K. Collect candidate
  techniques with sources. Breadth over depth; every claim tagged with its provisional evidence tier.
* **Phase 2 — Deep dives.** For the strongest candidates (at least 15), build the full per-technique
  record (§10 schema). Where a claim is locally checkable in minutes, check it locally instead of
  citing it.
* **Phase 3 — Adversarial validation.** For every headline number that will appear in the executive
  summary: attempt to refute it (independent skeptic pass — tokenize the actual example, find the
  contradicting source, re-run the arithmetic). Kill or downgrade anything that doesn't survive.
  Document the kills — a graveyard section of popular-but-false savings claims is itself a deliverable.
* **Phase 4 — Synthesis.** Build the three composed stacks (§11) with end-to-end dollar math on the
  modeled session profile. Identify the negative-cost set. Write the executive summary last.
* **Phase 5 — Completeness critic + roadmap.** A final pass asking "what's missing — an area not
  swept, a claim unverified, a file unwritten?" Whatever it finds becomes the last round of work.
  Then write `32-adoption-roadmap.md`.

## 10. Deliverables [#10-deliverables]

Create `token-optimization/` at the repository root:

```
token-optimization/
├── README.md                            # index, how to read, the tier list, headline numbers, research date
├── 00-executive-summary.md              # findings in 2 pages: the stack, the math, the verdict on 10x
├── 01-economics-and-measurement.md      # token economics, measurement tooling, the modeled session profile
├── 02-baseline-audit.md                 # Phase-0 measurements of this environment + caveman verification
├── 03-prior-art-and-market-scan.md      # shipped products, papers, tools — the landscape with evidence tiers
├── 10-style-and-language-compression.md
├── 11-tokenizer-arbitrage.md
├── 12-context-architecture.md
├── 13-caching-exploitation.md
├── 14-retrieval-memory-and-state.md
├── 15-output-discipline.md
├── 16-model-routing-and-delegation.md
├── 17-multi-agent-protocols.md
├── 18-provider-features.md
├── 19-infrastructure-level.md
├── 20-frontier-ideas.md                 # ≥10 ideas with feasibility verdicts
├── 30-composed-stacks.md                # Conservative / Aggressive / Unbelievable stacks with full math
├── 31-validation-harness.md             # the no-quality-loss proof protocol, runnable
└── 32-adoption-roadmap.md               # day-1 / week-1 / month-1; manual vs automatable (hooks, skills, jackin')
```

**Per-technique record schema** (every technique in files 10–20 uses exactly this structure):

* **Name** and one-line pitch
* **Layer:** input / output / cache / turn-structure / infra
* **Mechanism:** why it saves tokens, in plain language
* **Expected savings:** % on which token class, with arithmetic on the modeled session profile
* **Evidence tier:** T1–T4, with sources (URL + access date) or local reproduction method
* **Quality risk:** failure modes, how degradation would manifest, and the falsification experiment;
  verdict: `NEGATIVE-COST` / `NEUTRAL` / `RISKY` / `QUALITY-TRADE`
* **Availability:** `CLAUDE-CODE-TODAY` / `SDK` / `GATEWAY-OR-SELF-HOST` / `NOT-USER-ACCESSIBLE`
* **Effort to adopt:** minutes / hours / days / project
* **Composability:** which other techniques it stacks with, and any anti-synergies
* **Validation protocol:** the exact experiment proving no quality loss for this technique

**Writing rules:** every file opens with a TL;DR of ≤5 bullets including the headline numbers;
tables for enumerable comparisons; prose for reasoning; no claim without its tier; plain language —
the operator should never need to decode jargon.

## 11. Scoring and the stacks [#11-scoring-and-the-stacks]

Score every technique: `value = expected $ saving on the modeled profile × confidence (by evidence
tier) ÷ adoption effort`, then place it on a tier list (S/A/B/C/F) in the README. Build three stacks
in `30-composed-stacks.md`:

1. **Conservative** — T1/T2 only, `CLAUDE-CODE-TODAY` only, zero quality risk. The "do this
   tomorrow" stack.
2. **Aggressive** — adds T3 and SDK/gateway-tier techniques with validation plans attached.
3. **Unbelievable** — everything defensible including `BUILDABLE` frontier ideas; the stack that
   chases the full 10x. State its total honestly even if it falls short, and name the binding
   constraint.

For each stack: composed dollar math on the modeled profile (compose savings per token class, then
convert to dollars — never multiply percentages across different token classes), the validation plan
that proves quality is preserved, and the failure modes of the composition itself.

## 12. Validation harness (`31-validation-harness.md`) [#12-validation-harness-31-validation-harnessmd]

Design a runnable protocol the operator can use to verify any technique or stack:

* A paired-task benchmark: same tasks run with and without the technique, n ≥ 10 per arm, measuring
  task success, $ per task, tokens by class, and turns to completion.
* **Canary tasks** targeting known compression failure modes: negation preservation, ordering-
  sensitive instructions, numeric precision, "don't do X" constraints, multi-step sequences.
* Quality judging method (objective checks where possible — tests pass, diff correct; LLM-judge with
  rubric where not), and the statistical bar for declaring "no quality loss."
* Where to read the numbers: API usage fields, `/cost`, OTel, transcript JSONL.

## 13. Operating rules for the run [#13-operating-rules-for-the-run]

* **Autonomy:** do not ask the operator questions. When a judgment call arises, make it, and record
  it in an Assumptions section of [the research index](./).
* **Git discipline — commit and push on every finding:** all work happens on the
  `chore/token-optimization` branch. Never switch branches and never commit to `main`.
  Every time you complete or meaningfully update an artifact — a finished report file, a recorded
  measurement table, a phase milestone, a README/tier-list update — commit it and push to `origin`
  immediately, in the same step. Do not batch hours of work into one commit: the operator follows
  the run live through pushed commits, and an unpushed artifact is invisible progress. Do not run,
  wait for, or verify CI/CD on these pushes — commit and push is the entire loop. Follow
  <RepoFile path="CONTRIBUTING.md">CONTRIBUTING.md</RepoFile>: Conventional Commits subject (`docs(research): &lt;what landed&gt;`) and DCO sign-off via
  `git commit -s`. Do not modify any pre-existing repository file; your writes are limited to
  `token-optimization/`.
* **Citations:** every external claim → source URL + access date. Every local measurement → the
  command or method that produced it.
* **Live verification:** pricing, feature availability, and beta status checked against live
  provider documentation during the run, not recalled from training data.
* **Scope honesty:** if an area cannot be completed to the bar, ship the file with an explicit
  `INCOMPLETE` banner naming what's missing — never silently thin it out.

## 14. Definition of done [#14-definition-of-done]

* [ ] All 19 files in §10 exist and follow the writing rules; README carries the tier list,
  headline numbers, research date, and Assumptions section.
* [ ] ≥ 40 techniques cataloged across files 10–19; ≥ 15 with the complete per-technique record.
* [ ] ≥ 10 frontier ideas in `20-frontier-ideas.md`, each with a feasibility verdict and math.
* [ ] Phase-0 baseline audit contains real measured numbers from this environment, including the
  caveman/wenyan tokenizer verification table and the thinking-vs-visible output decomposition
  (or an explicit note on why a measurement was impossible and what was done instead).
* [ ] Every headline number in `00-executive-summary.md` survived the Phase-3 adversarial pass;
  the claim graveyard (popular-but-false savings claims) is included.
* [ ] Three composed stacks with end-to-end dollar math; explicit verdict on whether 10x is
  achievable, and the binding constraint if not.
* [ ] The negative-cost technique set is explicitly identified.
* [ ] `31-validation-harness.md` is concrete enough to run without further design work.
* [ ] `32-adoption-roadmap.md` separates automatic (hooks/skills/plugin/jackin'-baked) from
  discipline-dependent adoption, with a day-1 / week-1 / month-1 sequence.
* [ ] Every external claim carries a source + access date; every measurement carries its method.
* [ ] Every artifact landed as an incremental commit pushed to `origin` on
  `chore/token-optimization` — no batched end-of-run dump; the final state is pushed.
* [ ] A final self-audit against this checklist is appended to [the research index](./), with each box checked
  or honestly marked failed.