jackin'
ResearchToken Optimization Research

40 — Volume II extension: gap audit and blind-spot map

40 — Volume II extension: gap audit and blind-spot map

(Volume I froze). This is the first artifact of Volume II: an independent re-mapping of the token-optimization space, overlaid on the frozen Volume I dossier (files 00–32), to find the cells Volume I left blank or drew too thin. It is deliberately pushed before the deep dives so the gap map can be reviewed before research spends on it. Volume I is treated as finished and correct; nothing here edits files 00–32.

TL;DR

  • Eight seeded blind spots were audited by overlaying an independent six-axis taxonomy on all 19 Volume I files (a 14-agent coverage sweep plus a main-process grep). Verdict: five are genuinely thin or absent (quota economics, multimodal/vision/PDF, latency-as-a-cost- axis, budget governance, online quality detection), three are partially covered with specific open sub-questions (fleet/multi-tenant, cross-provider portability, fresh literature). All eight survive as real Volume II work; none turned out already-covered.
  • The optimization target may be the wrong unit for this operator. The local credential is a Max subscription (~/.claude/.credentials.jsonsubscriptionType: max). Volume I prices everything in dollars on API rates and explicitly flags but never solves the quota question; "quota" appears ≥5 times in file 13 and is modeled zero times. For a flat-rate subscriber the binding constraint is the weekly/session cap, where one measured account burned 1,310 cache-read tokens per productive I/O token (GitHub #24147) — so several Volume I "savings" are dollar-only and may vanish or invert against a cap. This is the strongest gap.
  • Multimodal is a near-total blank: 0 real hits in 4,303 dossier lines (the only adjacent fact is one "~125k tokens per 500 kB PDF" page-size estimate). Coding agents screenshot TUIs and browsers, read PDFs, and paste diagrams; image/PDF token cost is measurable today via count_tokens and is unmeasured anywhere in Volume I.
  • Live drift already found (rule #5): count_tokens now rejects claude-fable-5 ("not available — use Opus 4.8"), the model Volume I measured on directly. Volume I established Fable 5 and Opus 4.8 share one tokenizer (exact-family equality), so claude-opus-4-8 is the valid Fable-family proxy for all Volume II token counts; this substitution is noted wherever used. Pricing, plan limits, and betas are re-verified live in the area files, not recalled.
  • Volume II will ship area files 41–47 (one per pursued gap), a frontier file 48 (≥6 new ideas not duplicating Volume I's sixteen K-ideas), and a stacks-and-verdict file 49 (coverage-delta ledger, Corrections to Volume I, and whether any of this moves Volume I's ≈2.6x / ≈5–6.6x / no-true-10x picture). Preliminary verdict-delta below; the arithmetic lands in 49.

The pricing, modeled session profile, and instrument conventions are inherited from 01-economics-and-measurement.md; dollar arithmetic reuses Volume I's $22/day working profile (README Assumption 6) so Volume II numbers compose with Volume I's.


Gap-audit method

The brief forbids restating Volume I's own table of contents and asks instead for an independent taxonomy overlaid on it. The procedure, all:

  1. Independent taxonomy (no web, built first). The token-optimization space was decomposed along six axes chosen without reference to Volume I's A–L area letters (below). The axes are the dimensions of the problem (what you are billed in, which token class carries it, which surface emits it, which lever acts, at what scope, via what delivery), not a list of techniques.
  2. Overlay by coverage sweep. A 14-agent read-only workflow (vol1-coverage-map, one agent per Volume I file 01–32, model Explore, structured output) rated each file's depth on each of the eight seeded blind spots — absent / mention / partial / full — with a file:line citation and an evidence quote. Files 00, 03, 13, and 20 were read in full by the main process instead.
  3. Independent cross-check. A main-process grep -rniE over the eight blind-spot term clusters produced hit counts per file, used to confirm or challenge each agent's depth rating. Where the two disagreed (e.g. multimodal showed 14 raw grep hits), the hits were inspected by hand; all 14 were false positives (revision/decision) or one incidental PDF page-size line — confirming absent.
  4. Verdict. A blind spot is "confirmed thin" only if no file rated full on it and the partials leave the decision-relevant sub-question open. Each confirmed cell carries a one-line dollar-or-quota rationale for why closing it matters.

This overview is the map. The deep dives (E1 fresh web sweep, E2 per-technique records, E3 adversarial validation, E4 verdict delta, E5 self-audit) follow in later commits.

An independent taxonomy of the space

Six orthogonal axes. Any technique is a point in this space; Volume I's density is uneven across it.

AxisValuesWhere Volume I is denseWhere Volume I is thin/blank
A. Cost metric — what you are billed indollars · subscription quota/cap · wall-clock/latency · human attentiondollars (01, all)quota (named, unmodeled) · latency (scattered mentions, no model) · human-time (absent)
B. Token class carrying the costuncached-in · cache-write · cache-read · visible-output · thinking-output · image/vision · document/PDFall five text classes (02, 13, 15)image and document classes (absent)
C. Surface emitting tokenssystem/prefix · tools · messages/context · model output · tool-result media (screenshots, PDFs)first four (02, 12, 15)media tool-results (absent)
D. Lever classstyle · tokenizer · context-arch · caching · retrieval · output-discipline · routing · multi-agent · provider-features · infra · governance/guardrails · cross-agent portabilitythe ten Volume I areas 10–19runtime governance (max_tokens only as a rail) · portability (no matrix) · online quality guarding (offline only)
E. Scope of actionturn · session · cross-session · single-container · fleet/multi-tenant · orgturn→cross-session (12, 13, 14)hosted fleet sharing (self-host done in 19; hosted-subscription fleet thin) · org-cache (mention)
F. Delivery mechanismdiscipline · config · hooks/skills · orchestrator-baked · provider-actionall (32, 20 K16)— (well covered)

Volume I is essentially complete on axis D rows 10–19, axis B text classes, and axis F. The blind spots are concentrated in axis A (any metric other than dollars), axis B/C media classes, axis D governance/portability/online-quality, and axis E fleet scope. That is the shape Volume II fills.

The blind-spot map

Depth is the best rating any single Volume I file earned on that topic in the coverage sweep. "Stake" is the dollar-or-quota reason the cell matters. Citations are file:line in Volume I.

#Blind spotVol I best coverageVerdictStake (why it moves a number or decision)Vol II target
1Subscription & quota economics13:237 Gaps#1, 13:10/204/222 (#24147 1,310:1, "no formula"); 19:7 cost-split onlyTHIN — named ≥5×, modeled 0×For a Max subscriber the cap, not dollars, binds. Cache-read levers that look free in $ may dominate quota; some Vol I savings invert.41
2Multimodal / vision / PDFnone; 03:267/18:165 one "~125k tok/500 kB PDF" estimateABSENT — 0 real hits/4,303 linesA screenshot can be a token bomb or a bargain vs a DOM/AST dump; unpriced. Measurable now via count_tokens.42
3Latency / wall-clock / human-time17:110 "dollars-for-wallclock", 18:132 batch-latency, 19:147 TTFT; 01:52 "speed not savings"PARTIAL — mentioned widely, never a modelWhen finishing faster is worth more than the tokens it costs (fan-out, fast mode, proxy round-trips), Vol I gives no decision rule.43
4Fleet / team / multi-tenant cache19:116-187 self-host (full); 13 tech 7 excludeDynamicSections; 17:110 spawn waves; 30:115 U5PARTIAL — self-host done; hosted-fleet sub-questions openDynamic-section size is unmeasured (13 Gaps#6); hosted N-container prefix sharing and fleet×quota are unpriced.44
5Cross-agent / cross-provider portability14/15:45 scattered availability labels; 18:196 OpenAI/Gemini caching baselinesTHIN — no portability matrixA stack that dies on an agent switch is fragile; which levers survive Cursor/Codex/Gemini/Copilot/Aider/OpenCode is unstated.45
6Fresh literature & market deltastrong scan (10 SoT/TALE, 12 SWE-Pruner, 19 HiCache/LMCache/RadixAttention, 16 RouteLLM)PARTIAL — specific holesMissing entirely: KV-eviction/quant family (SnapKV/H2O/PyramidKV/KVQuant = 0 hits), CAG (0 hits), "context engineering" (0 hits), and any provider changelog drift since 06-12.46
7Vol I's own open questions, worked15:196-201, 18:49, 16:251, 11:192, 13 Gaps, 02:207OPEN — enumerated, unansweredEffort→thinking %, prior-turn-thinking billing, count_tokens-vs-billed drift, dynamic-section size — each now locally answerable; some change stack math.distributed → 49 ledger
8Meta: optimization cost, online quality, governance15:123 max_tokens-as-rail, 32:76 CI linter, 31 offline harness, 32:13 canary re-runsTHIN — no runtime governance, no online drift detectionMeasurement machinery has its own token cost (break-even of optimizing); production drift needs live canaries, not an offline suite; hard spend caps/circuit breakers are unbuilt.47

Volume II index

FileTitleMaps to blind spot(s)Status
40-extension-overview.mdThis gap audit and blind-spot mapmethodlanded
41-subscription-and-quota-economics.mdQuota-weighted cost model for a capped subscriber1 (+ fleet×quota of 4)pending
42-multimodal-token-economics.mdImage / screenshot / PDF token costs, measured2pending
43-latency-and-time-economics.mdWall-clock and human-time as a second cost axis3pending
44-fleet-and-multitenant-cache.mdHosted cross-container cache sharing and dedup4pending
45-cross-agent-portability.mdPortability matrix across coding agents5pending
46-fresh-literature-and-market-delta.mdClean-room re-sweep; KV-eviction family, CAG, changelog drift6 (+7)pending
47-meta-cost-governance-and-online-quality.mdCost of optimizing; budget governance; live quality guards8 (+7)pending
48-extension-frontier.md≥6 new frontier ideas (not duplicating K1–K16)pending
49-extension-stacks-and-verdict.mdCoverage-delta ledger, verdict delta, Corrections to Volume I7pending

If any pursued gap collapses on contact with research (turns out adequately covered, or its arithmetic proves it cannot move a number), its file ships with an INCOMPLETE banner saying so and the count of full area files stays at the brief's floor of five.

Preliminary verdict delta (hypotheses to be settled in 49)

Volume I's headline stands until 49's arithmetic says otherwise. The candidate movers, in order of how much they could change the picture:

  1. Metric replacement, not multiplier change (largest). If the operator is quota-bound (Max), the right denominator is "tasks per weekly cap," not "dollars per task." Under that metric the tier list re-sorts: cache-read-heavy levers and fleet fan-out can lose even where they win on dollars, because reads dominate quota ~1,310:1. Volume II's likeliest headline is a second cost model presented alongside Volume I's dollar model, not a new multiple on the same axis.
  2. No new dollar multiplier is expected to break the 10x wall. The binding constraints Volume I named (frontier thinking output; the cache-read floor) are structural; nothing in the eight gaps obviously removes them. Multimodal, latency, and governance change which choice is correct and what you risk, not the ceiling on dollar reduction at equal quality.
  3. A few gaps may add modest, real dollar levers (e.g. screenshot-vs-text substitution where a screenshot is genuinely cheaper; vision-token discipline). These will be costed honestly on the profile and slotted into the tier list in 49.

Self-audit mirror (Volume II definition-of-done — live)

  • Blind-spot map built by overlaying an independent taxonomy on Volume I with file:line evidence of thin/absent coverage. (this file)
  • ≥5 new area files (41–47), writing rules followed; ≥25 genuinely-new techniques with coverage-delta notes; ≥10 with the full record.
  • ≥6 new frontier ideas with feasibility verdicts + math. (48)
  • Subscription/quota cost model delivered, or INCOMPLETE banner naming what Anthropic does not publish and what was measured instead. (41)
  • Multimodal/vision/PDF token costs measured locally via count_tokens with method shown. (42)
  • Every Volume II headline number survived an adversarial pass; a Volume II graveyard included. (E3 → 49)
  • Verdict delta with arithmetic, even if "no change." (49)
  • Any Volume I contradiction reconciled into the relevant file. (49)
  • Every external claim has source; every measurement its method.
  • Every artifact committed and pushed to origin on chore/token-optimization as it lands. (in progress)
  • Volume II self-audit appended to README; judgment calls in the Volume II Assumptions section. (E5)

Instruments and conventions (Volume II)

  • count_tokens via OAuth (free, non-billable), rebuilt this run at /tmp/ct.py (the Volume I path; the prior container's copy did not persist). Reads ~/.claude/.credentials.jsonclaudeAiOauth.accessToken, posts to /v1/messages/count_tokens with the oauth-2025-04-20 beta header. Sanity check, "The quick brown fox jumps over the lazy dog.": Opus 4.8 = 24, Sonnet 4.6 = 18, Haiku 4.5 = 18 (+33% Fable-family premium — consistent with Volume I's ~30%).
  • Fable-family tokenizer = claude-opus-4-8 for Volume II (Fable 5 no longer accepts count_tokens; the two share a tokenizer per Volume I 00 §10). All "Fable 5" token counts in Volume II are Opus 4.8 counts and labeled as such.
  • Transcripts: ~/.claude/projects/**/*.jsonl (19 files this run) carry per-call message.usage; same source Volume I used for decomposition.
  • No image/PDF tooling on this box (no PIL/imagemagick/qpdf). Volume II generates test PNGs and PDFs from the Python standard library (zlib) so the image/document token curves can be measured at controlled dimensions; method shown in 42.
  • Dollar profile: Volume I's $22/day working figure (6 sessions, 55% thinking) for any $-arithmetic; the $17/day floor where a file explicitly uses it. Ratios are profile-invariant.

On this page