46 — Fresh literature & market delta (clean-room re-sweep)
46 — Fresh literature & market delta (clean-room re-sweep)
Volume II area file for blind spot 6: a clean-room pass for work Volume I's 03 market scan does not cite, plus everything token-relevant that shipped or published around/after Volume I's freeze. Each entry is tagged timing (published after the freeze) or oversight (existed before, uncited), with its evidence tier and a verdict on whether it changes a Volume I number..
TL;DR
- The single biggest market-delta for this operator: Fable 5's free subscription window ends . Anthropic made Fable 5 free on Pro/Max/Team/Enterprise "through June 22" and "will remove Fable 5 from those plans" on June 23 (credits required after). Volume I measured on Fable 5 inside that window. After 06-23 the cheapest same-tokenizer-family model for a subscriber flips to Opus 4.8 ($5/$25 vs Fable's $10/$50, identical tokenizer) — and this environment already runs Opus 4.8 on the main loop (measured: 465/560 calls). So Volume I's Fable-5-priced dollar figures should be read as ~half for an Opus-4.8 subscriber (carried to 49).
- The KV-cache compression family Volume I missed (SnapKV, H2O, PyramidKV, KVQuant) is real, NeurIPS-grade, and entirely self-host-only on a hosted API. Eviction (H2O 5×@20%, SnapKV 8.2× memory@16K, PyramidKV parity@12% retention) and quantization (KVQuant 3-bit, <0.1 ppl, 10M context) reduce server-side GPU memory, not API-billed tokens — they extend Volume I's self-host tier (19), not the hosted-subscription operator's options. Missed by oversight; verdict matches Volume I's K1/K2 hosted-API block.
- CAG (Cache-Augmented Generation) is the one genuinely new buildable-on-hosted-Claude idea here. The literal mechanism (export/persist KV tensors) is self-host-only, but its pattern — preload a stable corpus once and reuse it instead of retrieving — is approximable today via Anthropic prompt caching: a repo/spec that fits 200K context becomes a long-lived cached prefix read at 0.1×, composing with caching rather than fighting it (unlike LLMLingua). REAL-NOW.
- A new risk axis: prompt-compression proxies are an attack surface. CompressionAttack (arXiv 2510.22963) flips downstream agent behavior with up to 80% success by perturbing inputs that survive compression — adding a security reason to Volume I's cost-based "don't put a compressor in the hot path" verdict. Two new lossless-compression papers (LoPace at-rest; LTSC meta-tokens) do not help a hosted-API token bill (one compresses disk bytes, the other needs a fine-tuned model).
- No post-06-12 change to Anthropic's API cost primitives (caching multipliers, context-editing
beta, memory tool, effort ladder all unchanged, re-). The deltas are Claude Code
app features (
/cdpreserves cache; subagents now nest 5 deep;/usagegained a cache-miss line) and the 06-15 billing split (file 41). Routing research broadened past RouteLLM (LLMRouterBench, OrcaRouter) but none produces a Claude-cache-aware number that overrides Volume I's verdict.
Dollar/quota context inherited from Volume I and file 41; tiers per finding.
A. Provider changelog drift (re-verified live)
| Change | Date | Tier | Effect on a Volume I number |
|---|---|---|---|
| Fable 5 free window ends; removed from plans 06-23 | T1 | Big — model-tiering flips to Opus 4.8 for subscribers (49) | |
SDK/claude -p/GitHub-Actions usage off the subscription cap → separate API-rate credit | T1 | Bifurcates the cost model by surface (file 41 Q7) | |
| Claude Code: subagents nest up to 5 levels deep (2.1.172) | T1 | Compounds the ~7× team blowup geometrically (47 governance) | |
Claude Code /cd preserves the prompt cache across a dir change (2.1.169) | T1 | New negative-cost cache lever (technique FL2) | |
/usage adds cache-miss + subagent/long-context attribution (2.1.174) | T1 | Partially closes Volume I white-space #2 (cache tooling) | |
| OpenAI GPT-5.5 in API ($5/$30, $0.50 cached = 0.1×; effort none→xhigh; 1.05M ctx; >272K billed 2×/1.5×) | T1 | Portability (45); no Claude-side change | |
OpenAI prompt_cache_retention defaults to 24h (non-ZDR) | T1 | Strengthens OpenAI cross-provider cache case (45) | |
| OpenAI container sessions billed per-minute (5-min min) | T1 | Minor; rewards finishing fast (43) | |
| Gemini implicit caching default, 90% discount | live | T1 | Confirms Volume I baseline; no drift |
| Anthropic API cost primitives (cache 0.1×/1.25×/2×, context-editing, memory tool, effort) | unchanged | T1 | Volume I's API math stands as of 06-13 |
The verdict on the changelog: Anthropic's billable API primitives did not move; the operator-facing deltas are app features and the two billing changes (Fable promo, 06-15 split). Both billing changes hit the subscriber directly and are folded into files 41 and the verdict (49).
B. The KV-cache compression family (eviction + quantization) — self-host-only
Two branches, all NeurIPS-grade, all 0-hit in Volume I (oversight). A unified user-reachable implementation exists (KVCache-Factory, github.com/Zefan-Cai/KVCache-Factory) — but only for self-hosted open models.
| Method | Branch | Headline | Venue | Hosted Claude? |
|---|---|---|---|---|
| H2O (2306.14048) | eviction (heavy-hitter) | KV memory −5× @20% budget; throughput up to 29× | NeurIPS 2023 | No — self-host |
| SnapKV (2404.14469) | eviction (observation-window) | 8.2× memory, 3.6× gen speed @16K; "comparable" quality | NeurIPS 2024 | No — self-host |
| PyramidKV (2406.02069) | eviction (per-layer budget) | full-cache parity at 12% retention | preprint (T2) | No — self-host |
| KVQuant (2401.18079) | quantization (3-bit) | <0.1 ppl loss; 10M context on 8×A100 | NeurIPS 2024 | No — self-host |
Verdict: these reduce GPU memory and extend context inside the inference engine; the API boundary still bills the same prompt tokens. They are a self-host serving lever — the same scope as Volume I's vLLM/SGLang/HiCache section (19) and its K2 (LMCache/CacheGen) — and do not give a hosted-Claude subscriber a token or quota saving. Honest negative result that bounds the gap.
One sharpening for thinking-heavy Claude work: "Hold Onto That Thought" (arXiv 2512.12008) finds KV compression on reasoning can be perverse — eviction at low budgets lengthens reasoning traces (more output tokens), the same "compress-more-costs-more" direction Volume I found for LLMLingua. So even on self-host, KV-evicting a thinking model can raise output cost. (Exact accuracy-drop percentages need the full PDF — flagged.)
C. CAG vs RAG — and the buildable hosted-Claude pattern
CAG ("Don't Do RAG…", arXiv 2412.15605, WWW '25 short paper) preloads the whole knowledge base into context, precomputes and persists its KV cache, and answers queries with no retrieval step. It meets or beats RAG on BERTScore when the corpus fits the context window (small margins, single 8B model). The literal mechanism (persist KV tensors) is self-host-only. (The 2.33 s vs 94.35 s latency and exact BERTScore figures appear only in secondary sources — not cited as primary; flagged.)
The portable insight is the technique below (FL1): on hosted Claude the CAG pattern is approximable with prompt caching — preload the stable corpus as a cached prefix, reuse at 0.1×.
D. Prompt compression past LLMLingua / 500xCompressor
| Work | What | Verdict for a hosted-Claude token bill |
|---|---|---|
| LoPace (2602.13266, Feb 2026) | lossless compression at rest (Zstd/BPE+packing), 72.2% disk saving, 100% reconstruction | Irrelevant to tokens — compresses bytes on disk, not tokens sent; only helps a memory-store's footprint (Vol I 14). Category error to avoid. |
| LTSC meta-tokens (2506.00307, May 2025) | lossless token compression, 18–27% sequence reduction | Self-host-only — needs a model fine-tuned to decode meta-tokens; a frontier hosted model can't read them. Bounds the "lossless input compression on hosted APIs" hope as still empty. |
| CompactPrompt (secondary, 2026) | pruning + abbreviation + data-quantization, "up to 60%" | T4, do not propagate — single vendor guide, lossy NL-oriented; the LLMLingua code-degradation + cache-breaking caveats almost certainly hold. |
| CompressionAttack (2510.22963, Oct 2025) | security: perturbations surviving compression flip agent behavior (≤80% ASR) | New risk axis — adds a security reason to Volume I's cost-based "no compressor in the hot path." |
Net: the prompt-compression frontier since LLMLingua-2/500x produced no new lossy compressor that is both user-reachable on hosted Claude and safe for code — the two lossless advances are either at-rest (no token effect) or self-host (fine-tuned decoder), and the one fresh empirical result is an attack on compressors. Volume I's verdict (LLMLingua-class proxies are a QUALITY-TRADE/cost trap for coding) is reinforced, now on security grounds too.
E. Routers past RouteLLM / RouterArena
- LLMRouterBench (2601.07206, Jan 2026): 400K+ instances, 33 models, 10 baselines; top routers +4% accuracy or 31.7% cost reduction at baseline performance. The larger benchmark Volume I (16) did not have; a better instrument, but its cost numbers are benchmark-wide, not Claude-pair- or cache-aware, so it does not override Volume I's verdict that per-request gateway routing breaks Claude's model-scoped cache.
- OrcaRouter (2605.30736, May 2026): production LinUCB contextual-bandit router, RouterArena #2 at submission. Resolves part of Volume I 16's open "no current router eval" question — still a stateless-API metric, same cache caveat.
- RouterEval (EMNLP 2025): pointer (secondary-sourced); confirms the routing-eval field broadened.
Verdict: routing evaluation matured, but the Claude-Code-specific blocker Volume I identified (per-request routing forfeits the model-scoped prompt cache, and tokenizer-blind routers mis-price cross-tier moves) is unchanged. No new router produces a cache-aware Claude number.
F. "Context engineering" tooling — a rename, not new math
A "context engineering" platform category solidified by mid-2026 (Atlan Context Studio, Packmind, Tessl, Ruler; native features in Claude Code/Cursor/Copilot), framed as a governed 5-phase context layer with vendor accuracy claims (94–99% "with governed context" vs 10–31% "without"). The accuracy figures are vendor-interested and unmeasured (T4). Honest read: this renames techniques Volume I already covers (retrieval, memory, context architecture) under an enterprise-platform banner — it changes vocabulary, not the dollar or quota math. Logged so the dossier is not blind to the term.
Techniques
FL1. CAG pattern on hosted Claude — preload the stable corpus as a cached prefix, reuse at 0.1×
The buildable residue of CAG: when a knowledge base fits the 200K subscription context, cache it once instead of retrieving per turn.
- Coverage-delta: New. CAG is 0-hit in Volume I (confirmed 40-overview); Volume I 14 debates RAG-vs-agentic-search but never the preload-and-cache pattern.
- Layer: input / cache / retrieval.
- Mechanism: place a stable corpus (the repo's key files, a spec, an API reference) in the prompt
prefix behind a
cache_controlbreakpoint; the first turn writes it (1.25–2×), every later turn reads it at 0.1× with no retrieval step, no retrieval errors, and no per-query embedding cost. This is the hosted-API analogue of CAG's "preload + reuse KV" — it composes with prompt caching, unlike LLMLingua which fights it. - Expected savings: converts per-turn RAG retrieval+injection into a one-time cached-prefix write
- 0.1× reads. Wins when the corpus fits context and is queried many times within a TTL; on a subscription the cached reads are also ~0.1× against the cap (file 41). Bounded by the 200K subscription context (file 41) — large corpora still need retrieval.
- Evidence tier: T2 (CAG paper, WWW '25) for the pattern's quality-vs-RAG; T1 for the prompt-cache mechanics it rides.
- Quality risk: NEUTRAL-to-NEGATIVE-COST when the corpus fits (CAG meets/beats RAG on BERTScore; removes retrieval errors). RISKY if the corpus exceeds context (silent truncation) or goes stale in the cached prefix. Falsify by comparing answer quality preload-vs-retrieve on your corpus.
- Availability: CLAUDE-CODE-TODAY (a stable prefix / read-only files) / SDK (
cache_control). - Effort to adopt: hours (curate the corpus into a stable cached block).
- Composability: stacks with prompt caching (13) and the codebook/pointer ideas (Volume I 14/20); anti-synergy with anything that mutates the prefix (busts the corpus cache).
- Validation protocol: for a repeatedly-queried corpus, A/B preload-and-cache vs per-turn
retrieval; require equal answer quality and confirm later turns show
cache_readon the corpus.
FL2. Use /cd to change directory without busting the cache
A shipped post-06-12 cache-preservation lever Volume I's caching file predates.
- Coverage-delta: New (Claude Code 2.1.169). Volume I 13 enumerates cache-safe vs
busting operations but predates
/cd. - Layer: cache / session habits.
- Mechanism:
/cdmoves a session to a new working directory "without breaking the prompt cache mid-session"; previously a directory change churned the dynamic prefix (the six-field block, file
- and forced a full rewrite at 1.25–2×.
- Expected savings: one avoided full-history rewrite per directory change — the per-bust economics of Volume I 13 tech 1 / file 41 Q2 (≈150K × (1.25–2.0 − 0.1) tokens).
- Evidence tier: T1 (Claude Code changelog 2.1.169).
- Quality risk: NEGATIVE-COST (same context, cache preserved).
- Availability: CLAUDE-CODE-TODAY (v2.1.169+).
- Effort to adopt: zero (use
/cdinstead of restarting in a new dir). - Composability: extends Volume I 13's cache-safe operation list; pairs with prefix-stability (41 Q2).
- Validation protocol: change directory via
/cdvs a restart; confirm the/cdpath showscache_readcontinuity in JSONL.
FL3. Keep prompt-compressors out of the hot path — now for security too
Volume I disfavored LLMLingua-class proxies on cost; CompressionAttack adds an integrity reason.
- Coverage-delta: New risk axis. Volume I 19 covers the cost trap of compression proxies; the compression-as-attack-surface result is 0-hit.
- Layer: infra / security (meta).
- Mechanism: CompressionAttack (2510.22963) crafts inputs that survive a compression module but flip downstream agent behavior (≤80% ASR, 98% preference-flip, high stealth) — targeting both hard (LLMLingua/Selective-Context) and soft (ICAE/AutoCompressors) compressors. A compressor in the request path is therefore both a cost risk (Volume I) and an exploitable integrity boundary.
- Expected savings: none — loss-avoidance. Reinforces routing compression to ingestion-only seams (Volume I 20 K15) or avoiding it entirely for code.
- Evidence tier: T2 (peer-reviewed-style preprint, Oct 2025).
- Quality risk: QUALITY-TRADE / security — the proxy can be weaponized.
- Availability: n/a (a reason not to deploy a class of tool).
- Effort to adopt: minutes (policy: no compressor in the hot path; see file 47).
- Composability: strengthens Volume I's LLMLingua kill; relevant to governance (47) and infra (19).
- Validation protocol: if a compressor is unavoidable, red-team it with CompressionAttack-style perturbations before trusting its output.
FL4. Treat the KV-compression family as a self-host lever only — and beware it on reasoning
A clear verdict so the operator stops chasing SnapKV/H2O/PyramidKV/KVQuant on a hosted API.
- Coverage-delta: New (0-hit in Volume I). Extends Volume I 19 (self-host) and 20 K2; the reasoning-trace-lengthening danger is from "Hold Onto That Thought" (2512.12008).
- Layer: infra (self-host serving).
- Mechanism: eviction (H2O/SnapKV/PyramidKV) and quantization (KVQuant) shrink the GPU KV cache to extend context and cut serving memory/latency — but only inside an inference engine you control. On reasoning models, low-budget eviction can lengthen traces (more output), the same perverse direction Volume I found for input compression.
- Expected savings: $0 on hosted Claude. On self-host: 5–8× KV memory / extended context per the papers — a capacity lever, not a hosted-API token saver.
- Evidence tier: T2 (NeurIPS papers) for ratios; T1 for the hosted-API block.
- Quality risk: QUALITY-TRADE / RISKY on reasoning workloads (trace lengthening); NEUTRAL for memory-bound non-reasoning self-host.
- Availability: GATEWAY-OR-SELF-HOST (KVCache-Factory unifies SnapKV/H2O/PyramidKV/StreamingLLM).
- Effort to adopt: extreme (self-host stack) — irrelevant for the hosted-subscription operator.
- Composability: self-host stack only (with Volume I 19/K2); orthogonal to everything hosted.
- Validation protocol: if self-hosting, benchmark the chosen method on your reasoning workload for trace-length inflation before adopting at low KV budgets.
FL5. Re-decide model tiering after — Opus 4.8 becomes the cheapest Fable-family option
When the Fable 5 promo ends, the dominant-profile model choice flips for subscribers.
- Coverage-delta: Drift, not in Volume I (the promo window post-dates the freeze framing). Volume I's tiering (16) and profile assume Fable 5 as the main model.
- Layer: routing / model selection.
- Mechanism: from 06-23, Fable 5 on Pro/Max/Team/Enterprise requires usage credits (dollars); Opus 4.8 — the same tokenizer family (identical token counts) at half the sticker ($5/$25 vs $10/$50) — remains. So the cheapest model that preserves Volume I's tokenizer math is Opus 4.8, and this environment already runs it (465/560 calls measured).
- Expected savings: for a subscriber who was on Fable 5, moving the main loop to Opus 4.8 halves the per-token dollar cost at identical token counts (and avoids spending credits on Fable post-promo). Volume I's Fable-priced dollar figures are ~2× an Opus-4.8 subscriber's (carried to 49).
- Evidence tier: T1 (Fable promo announcement; pricing).
- Quality risk: QUALITY-TRADE (mild) — Fable 5 is the more capable model; the hardest tasks may regress on Opus 4.8. Falsify on your hardest task class (Volume I's open quality-parity question).
- Availability: CLAUDE-CODE-TODAY (
/model). - Effort to adopt: minutes.
- Composability: the tokenizer family is unchanged (Opus 4.8 ≡ Fable 5 tokenizer), so all of Volume I's tokenizer-arbitrage and file-42 image findings transfer directly.
- Validation protocol: run the hardest task class on Fable 5 vs Opus 4.8 before 06-23; if parity holds, default to Opus 4.8.
Surprising findings
- The whole fresh KV-compression literature (four NeurIPS-grade methods + a unified library) lands on the same verdict as Volume I's gist/LMCache frontier: real, impressive, and $0 for a hosted-Claude user. The hosted/self-host wall is the field's defining constraint, not any single technique.
- The most actionable new idea is the oldest one reframed: CAG is just "cache the corpus," which hosted prompt caching already does — the paper's contribution to a hosted operator is the framing, not the mechanism.
- The compression frontier's freshest empirical result is an attack on compressors, not a better compressor — the category Volume I disfavored got more dangerous, not more attractive.
- Anthropic's billable primitives didn't move in the months around the freeze; the action was all in billing policy (Fable promo, 06-15 SDK split) — which is exactly the quota axis (41) Volume I under-weighted.
Verification ledger
| # | Claim | Source (access) |
|---|---|---|
| 1 | Fable 5 free through 06-22, removed 06-23 (credits after); Fable $10/$50, Opus 4.8 $5/$25 | anthropic.com/news/claude-fable-5-mythos-5 |
| 2 | 06-15 SDK/headless/CI off subscription cap → separate API-rate credit | support.claude.com/.../use-the-claude-agent-sdk-with-your-claude-plan |
| 3 | Claude Code: nested subagents 5-deep (2.1.172); /cd preserves cache (2.1.169); /usage cache-miss + attribution (2.1.174) | code.claude.com/docs/en/changelog |
| 4 | SnapKV 2404.14469 (NeurIPS'24, 8.2×@16K); H2O 2306.14048 (NeurIPS'23, 5×@20%); PyramidKV 2406.02069 (parity@12%); KVQuant 2401.18079 (NeurIPS'24, 3-bit <0.1 ppl, 10M ctx); KVCache-Factory | arxiv.org + proceedings.neurips.cc + github.com/Zefan-Cai/KVCache-Factory |
| 5 | KV compression on reasoning lengthens traces at low budgets | arxiv.org/abs/2512.12008 (excerpt; full numbers need PDF) |
| 6 | CAG 2412.15605 (WWW'25): preload+persist KV, meets/beats RAG when corpus fits context; literal mechanism self-host | arxiv.org/abs/2412.15605 (latency/BERTScore figures secondary-sourced, flagged) |
| 7 | LoPace 2602.13266 (lossless at-rest, 72.2%); LTSC 2506.00307 (lossless tokens, needs fine-tuned model); CompactPrompt (T4 secondary) | arxiv.org/html/2506.00307; morphllm.com/prompt-compression |
| 8 | CompressionAttack 2510.22963: ≤80% ASR on prompt-compression modules | arxiv.org/html/2510.22963v2 |
| 9 | LLMRouterBench 2601.07206 (+4% acc / 31.7% cost); OrcaRouter 2605.30736 (RouterArena #2); RouterEval (EMNLP'25, pointer) | arxiv.org/html/2601.07206; arxiv.org/abs/2605.30736 |
| 10 | OpenAI GPT-5.5 ($5/$30, $0.50 cached, 1.05M ctx); cache retention default 24h (05-29); container per-min billing (06-02) | developers.openai.com/api/docs/changelog |
| 11 | Gemini implicit cache default, 90% discount ; no Anthropic API primitive change post-06-12 | ai.google.dev/gemini-api/docs/caching; platform.claude.com prompt-caching |
| 12 | "context engineering" tooling category (Atlan/Packmind/Tessl/Ruler); vendor accuracy claims T4 | atlan.com/know/context-engineering/... |