19 — Infrastructure-level (self-hosted / gateway tier)
19 — Infrastructure-level (self-hosted / gateway tier)
TL;DR
- Two regimes, opposite strategies. On hosted Claude you cannot touch the serving stack (only KV lever in the live Messages API:
cache_control, 4 breakpoints, 5m/1h TTL); native prompt caching already removes ~74% of what a heavy session would cost uncached (ESTIMATE, arithmetic below). The infra job there is protective: a gateway that breaks the cache silently makes the session ~3.8x more expensive (ESTIMATE). - On self-hosted open-weights backends, the real levers are all KV-cache-hit-rate engineering: vLLM V1 prefix caching (default-on, <1% overhead at 0% hit rate, T1), SGLang HiCache (production coding agent: hit rate 40%→80%, TTFT −56%, throughput 2x, T1-vendor), and cache-aware routing (20%→75% hit rate, 1.9x throughput vs round-robin, T1). The router matters more than the engine.
- The big trap is the prompt-compression proxy (LLMLingua-style): a cache-breaking compressor must exceed ~5.5x compression just to break even against Anthropic's 0.1x cache reads (ESTIMATE); code tolerates only ~10% prompt reduction (T2); a 2026 pre-registered RCT on Claude Sonnet 4.5 found keep-20% compression increased total cost 1.8% (T2). QUALITY-TRADE at best.
- Killed folklore (measured locally): base64 "compression" costs 4.33x MORE tokens; gzip+base64 cuts characters 17.8% while costing 2.68x MORE tokens; semantic-cache "~68% savings" numbers are consumer-QA transplants with no coding-agent evidence.
- Cheapest real wins today: gateway hygiene checklist (forward
anthropic-beta/anthropic-version, one workspace per traffic lane,CLAUDE_CODE_ATTRIBUTION_HEADER=0for body-keyed caches, assertcache_read_input_tokens > 0), and deterministic JSON minification of verbose MCP schemas (−22.9% on the rewritten segment, local measurement — but deferred tool loading saves more; see 12-context-architecture.md).
The two regimes
| Hosted Claude (jackin' / Claude Code today) | Self-hosted open weights behind Claude Code | |
|---|---|---|
| Billing unit | Tokens (Fable 5 $10/$50 per MTok; cache read 0.1x, write 1.25x/2x) | GPU-seconds; "tokens saved" = prefill compute saved |
| KV access | cache_control only (verified against live Messages API) | Full: prefix cache, offload tiers, routing, eviction |
| Strategy | Protect the 0.1x cache through any proxy; route models; batch | Maximize KV hit rate across sessions and replicas |
| Anti-pattern | Compression/rewriting proxies that destabilize the prefix | Round-robin load balancing (collapses hit rate toward 0%) |
All dollar arithmetic below uses the modeled session profile from 01-economics-and-measurement.md (local Phase-0 measurement): per heavy session ~19 API calls, 5.5k uncached input / 85k cache-write / 1.17M cache-read / 27k output; dollar split cache reads 32 / writes 29 / thinking 20 / visible output 17 / uncached input 2 (normalized to 100). Key derived quantity: full-price-equivalent input = 32/0.1 + 29/1.25 + 2 = 345.2 units, so the uncached-equivalent session is 345.2 + 37 = 382.2 units vs 100 actual → native caching removes 73.8% (~74%) of uncached-equivalent cost, and a fully broken cache costs 3.82x (~3.8x). ESTIMATE — n=1 session mix; the protective conclusion is robust to mix variation, the exact multipliers move with the cache-read share.
Local measurements
Method: text piped to /tmp/ct.py <model> → live count_tokens API. Fresh n=1 samples, independently reproducing the Phase-0 numbers (which used different samples).
| Variant | chars | tokens (claude-fable-5) | vs plain |
|---|---|---|---|
| English prose sample | 642 | 189 | — |
| base64 of same bytes | 856 | 819 | 4.33x MORE (Phase-0: 4.3x) |
| gzip+base64 of same bytes | 528 | 506 | 2.68x MORE despite 17.8% FEWER chars (Phase-0: 2.8x) |
| 5-param tool schema, pretty-printed (indent=2) | 938 | 310 | — |
| Same schema, JSON-minified | 712 | 239 | −22.9% tokens (Phase-0 schema: −25.7%) |
| Prose sample on claude-sonnet-4-6 tokenizer | 642 | 125 | Fable 5 tokenizer +51% on this jargon-dense sample (Phase-0 prose: +38%; direction confirmed, magnitude sample-dependent) |
The gzip row is the cleanest demonstration in this dossier that character-level "compression %" claims are not token claims: characters went DOWN, tokens went UP 2.7x.
Hosted tier (protective)
Claude Code gateway hygiene: preserve the 0.1x cache through the proxy
One-line pitch: the highest-leverage infra action for a hosted Claude Code fleet is purely defensive — a gateway that strips beta headers, sprays one team across workspaces, or perturbs request bytes silently converts 0.1x cache reads back to full-price input.
- Layer: infra (gateway config, hosted Claude).
- Mechanism: Cache hits require "100% identical prompt segments" (official docs). Three documented ways a gateway breaks this invisibly: (1) not forwarding
anthropic-beta/anthropic-versionheaders (Messages) or not preservinganthropic_beta/anthropic_versionbody fields (Bedrock) — "Failure to forward headers or preserve body fields may result in reduced functionality"; (2) cache isolation became per-WORKSPACE (Claude API, Claude Platform on AWS, Microsoft Foundry beta; Bedrock and Vertex remain org-level) — key rotation or failover across workspaces/providers splits the cache; (3) Claude Code "prepends a short attribution block to the system prompt containing the client version and a fingerprint derived from the conversation" — the Anthropic API strips it ("does not affect first-party prompt caching"), but any gateway caching on the raw body, or any non-Anthropic backend, sees it as prompt bytes; official kill switchCLAUDE_CODE_ATTRIBUTION_HEADER=0. The gateway must also expose/v1/messages/count_tokens. - Expected savings: protective, not additive: keeps 92.83% of prompt-side tokens billed at 0.1x (local measurement; cache reads+writes = 61% of session dollars). Cost of failure: 345.2 input units instead of 63 → broken-cache session ≈ 3.8x (ESTIMATE, arithmetic above). On the modeled day ($22), a silently broken cache ≈ $84/day.
- Evidence tier: T1 — all mechanism claims from official Claude Code / Claude API docs, fetched.
- Quality risk: NEGATIVE-COST in the protective sense — zero model-visible change; pure plumbing correctness. Degradation manifests as a cost-dashboard jump with identical outputs. Falsification: none needed for quality; the failure mode IS the bill.
- Availability: CLAUDE-CODE-TODAY (
ANTHROPIC_BASE_URL,ANTHROPIC_AUTH_TOKEN,CLAUDE_CODE_ATTRIBUTION_HEADER). - Effort to adopt: minutes — env vars, one workspace pinned per traffic lane, one monitoring assertion.
- Composability: precondition for every other hosted-tier technique in this dossier and for the cache work in 13-caching-exploitation.md; the
X-Claude-Code-Session-Id/X-Claude-Code-Agent-Id/X-Claude-Code-Parent-Agent-Idheaders give per-subagent cost attribution without body parsing. Anti-synergy: any request-rewriting middleware (see minification record below) unless provably byte-stable. - Validation protocol: run one fixed 10-turn Claude Code session direct-to-API and one through the gateway; diff per-call
usage: assertcache_read_input_tokens > 0from call 2 onward through the proxy and that the uncachedinput_tokensshare matches the direct run within noise. Re-run after every gateway upgrade. Alert on session-level cache-read share dropping below ~85% of prompt tokens. - Supply-chain note found in the official gateway doc : LiteLLM PyPI versions 1.82.7 and 1.82.8 were compromised with credential-stealing malware; Anthropic explicitly does not endorse or audit LiteLLM. Pin and verify gateway versions.
The not-user-accessible ledger (hosted API) — stop chasing these
One-line pitch: verified against the live Messages API reference : no logprobs, no draft-model/speculative parameter, no predicted-outputs, no KV export beyond cache_control, and no train-your-own-distillate path — blogs promising these on hosted Claude are selling self-host techniques in hosted clothing.
- Layer: infra (hosted API boundary).
- Mechanism: complete top-level Messages API parameter surface :
max_tokens, messages, model, cache_control, container, inference_geo, metadata, output_config, service_tier, stop_sequences, stream, system, temperature, thinking, tool_choice, tools, top_k, top_p. Speculative decoding is lossless by construction (same tokens, faster), so even server-side it cannot reduce a per-token bill, and there is no API to submit draft tokens. Anthropic Consumer Terms §3 prohibit using the service "to develop or train any artificial intelligence or machine learning algorithms or models" (commercial-terms wording not independently fetched — flagged). What you DO get: prompt caching, Batch API at 50% off, model routing (Haiku 4.5 $1/$5 vs Fable 5 $10/$50),service_tier,inference_geo. - Expected savings: zero by definition; the saving is operator time and avoided spend on middleware claiming these levers. (No arithmetic applies.)
- Evidence tier: T1 — absence verified in live primary documentation.
- Quality risk: NEGATIVE-COST — knowing the boundary prevents quality-degrading workarounds (e.g., sampling-based logprob estimation in production paths).
- Availability: NOT-USER-ACCESSIBLE (the listed levers); the substitutes are CLAUDE-CODE-TODAY/SDK.
- Effort to adopt: none.
- Composability: frames the area: hosted = protect caching + route models + batch; self-host = the four KV techniques below.
- Validation protocol: none needed (negative result); re-verify the parameter list against the live API reference quarterly, since new parameters ship without changelogs reaching gateway vendors.
Deterministic request rewriting at the gateway (tool-schema JSON minification)
One-line pitch: a gateway may rewrite requests losslessly-for-the-model — locally measured −22.9% tokens on a pretty-printed tool schema — but on hosted Claude the rewritten bytes live in the cached prefix, so realized savings are small, and one nondeterminism bug costs more than the rewrite ever saves.
- Layer: input (via infra: gateway request transformation).
- Mechanism: proxy transforms the body identically on every call (idempotent, byte-stable), so the prefix stays cache-consistent while containing fewer tokens. Local measurement (table above): 5-param schema pretty-printed 310 tokens → minified 239 (−22.9%; Phase-0's different schema: −25.7%). Claude Code's built-in tool schemas are already compact; this applies to verbose MCP servers and custom SDK tools.
- Expected savings: ~23-26% of the rewritten segment only (n=2 schemas, local). On the modeled profile: the 11 local MCP schemas = 1,420 tokens ≈ 2.3% of the ~61.6k-token per-call prefix; minifying them saves ~0.5% of prompt-side spend ≈ 0.3% of session dollars (ESTIMATE). Deferred tool loading (1,420 → ~60 tokens, local Phase-0) saves ~4x more on the same block. On self-host backends WITHOUT prefix caching configured, the full ~23% of the segment recurs every call, so it matters more there.
- Evidence tier: T1 — locally reproduced, method shown; no third-party product publishes numbers for this.
- Quality risk: NEUTRAL semantically (JSON minification is meaning-preserving) but RISKY operationally: an unstable serializer (key reordering, timestamp injection) violates the 100%-identical-segment rule and silently disables caching — net effect strongly negative (~3.8x, see hygiene record). Degradation manifests as
cache_read_input_tokenscollapsing to 0. Falsification: byte-diff two consecutive rewrites of the same request. - Availability: GATEWAY-OR-SELF-HOST.
- Effort to adopt: hours, plus a regression test asserting byte-stability of the rewrite.
- Composability: compatible with prompt caching ONLY if deterministic from the session's first request. Pairs with (and is dominated by) deferred tool loading — see 12-context-architecture.md. Anti-synergy with any per-request dynamic content injection.
- Validation protocol: (1) replay 20 identical requests through the rewriter, assert byte-identical outputs; (2) run a fixed session through the proxy and assert cache-read share unchanged vs direct; (3) A/B 20 tool-use tasks (minified vs pretty schemas) and assert equal tool-call success rate — minification should be output-identical, so any diff is a rewriter bug.
Gateway response caching and dedup (LiteLLM exact-match + semantic)
One-line pitch: a cache hit is a 100% token save, but every published hit rate comes from repetitive consumer QA — no vendor publishes a coding-agent hit rate, and interactive coding traffic will exact-match near zero.
- Layer: infra (gateway, whole-call short-circuit).
- Mechanism: exact-match — hash of normalized request body → cached response (LiteLLM backends: in-memory, disk, Redis, S3, GCS;
x-litellm-cache-keyheader on hits; TTL viacache_params, per-request override). Semantic — embed the prompt, vector-search past prompts, serve a cached answer abovesimilarity_threshold(LiteLLM Redis/Qdrant, threshold ~0.8 default-style). - Expected savings: on a hit, 100% of that call's tokens. The only quantified hit rates: GPT Semantic Cache paper — "reduces API calls by up to 68.8%", hit rates 61.6-68.8%, positive-hit (correct-reuse) 92.5-97.3% — measured on 500-query consumer categories ("Order and Shipping", "Customer Shopping QA"). LiteLLM/Portkey/Helicone publish NO hit rates (absence verified in LiteLLM docs). For interactive Claude Code traffic, expect ~0% exact-match (every turn embeds novel transcript bytes); honest scope = idempotent lanes: CI re-runs, eval suites, fan-out subagents re-asking identical questions. ESTIMATE: a CI lane that re-runs an unchanged eval suite nightly saves ~100% of that lane's spend on unchanged days, which on the modeled $22/day interactive profile is additive, not a % of it.
- Evidence tier: T1 that the features exist (docs); T2 for the consumer-QA hit rates (arXiv 2411.05276, via search digest — flagged, PDF tables not independently read); coding-agent applicability is T4.
- Quality risk: RISKY for coding agents — 92.5-97.3% positive-hit means 3-7% of semantic hits serve a WRONG answer confidently; same error message + different repo state = stale cached fix. Degradation manifests as confidently wrong answers that reference outdated repo state. Falsification: log every semantic hit with both prompts and have a judge model rate answer transferability; any sub-99% transfer on code traffic kills it. Verdict: RISKY (semantic) / NEUTRAL (exact-match scoped to idempotent lanes).
- Availability: GATEWAY-OR-SELF-HOST.
- Effort to adopt: minutes-to-hours in LiteLLM config; the hard part is scoping cache keys so only genuinely idempotent traffic is cacheable.
- Composability: orthogonal to KV/prompt caching (this saves whole calls; KV saves prefill). Per-request TTL lets CI lanes cache while interactive lanes stay uncached. Anti-synergy: semantic caching on top of agent traffic with repo state in prompts.
- Validation protocol: enable exact-match on a CI lane only; for 2 weeks compare CI job outcomes (pass/fail parity with uncached re-runs on a 10% holdout). Keep semantic mode off for code unless the transferability audit above passes.
Prompt-compression proxy (LLMLingua / LLMLingua-2; productized as Kong AI Prompt Compressor) — QUALITY-TRADE
One-line pitch: the most-marketed infra trick of the area and the worst fit for coding agents — paper-true 20x on math CoT does not survive code's ~10% compression tolerance, output-expansion economics, or Anthropic's 0.1x cache reads, which a recompressing proxy destroys.
- Layer: input (via infra: compression sidecar at the gateway).
- Mechanism: a small model deletes low-information tokens before the request reaches the big model (LLMLingua: small-LM perplexity ranking; LLMLingua-2: token-classification encoder distilled from GPT-4 decisions). Productized as Kong AI Prompt Compressor (Enterprise, Gateway >=3.11, sidecar running
microsoft/llmlingua-2-xlm-roberta-large-meetingbank; ratio/token-target modes; no performance numbers published by Kong — verified absent). - Expected savings: claimed: up to 20x compression (GSM8K 77.33 EM vs 78.85 baseline at 20x), LLMLingua-2 2-5x task-agnostic. Measured reality for code/agents: (a) ~10% prompt-reduction ceiling before significant code-generation degradation (ACM TOSEM "Less Is More"); (b) "Prompt Compression in the Wild" (arXiv 2604.02985): compression "adds latency without quality benefits for code generation", only 1.4x achievable on LongBench LCC, speedups >1.3x only on non-optimized stacks, "compression overhead negates gains" with vLLM or commercial APIs; (c) pre-registered RCT (arXiv 2603.23525): Claude Sonnet 4.5, 59-61 runs/arm (~358 total), r=0.5 saved 27.9% total cost, but r=0.2 INCREASED total cost 1.8% via 1.03x output expansion — "'compress more' is not a reliable production heuristic". Cache arithmetic on the modeled profile (ESTIMATE): a cache-breaking compressor at ratio r costs 345.2/r + 37 vs 100 baseline → break-even r ≈ 5.5x; at r=2 the session costs ~2.1x MORE; at r=5, +6%; even r=10 saves only ~28% — and code quality is gone long before r=5.
- Evidence tier: T2 for the original claims (EMNLP'23/ACL'24); T2 AGAINST coding/agent use (2026 RCT + 2026 measurement study + TOSEM); marketing numbers T4. Flagged unverified: TOSEM and 2411-era figures read via search digests, not full PDFs; whether the RCT's arms were cache-eligible is unknown (the RCT did not model provider cache economics — my arithmetic suggests that term dominates).
- Quality risk: QUALITY-TRADE shading into negative-value for coding agents: documented code-structure destruction ("zero accuracy on some code-related tasks" at high rates), edit-similarity drops even when surface accuracy holds, plus the cost arithmetic. Degradation manifests as compile errors, hallucinated identifiers from elided context, and longer outputs. Defensible niche: ONE-TIME deterministic compression of bulky static prose (docs/RAG chunks) so the compressed text itself becomes a stable cached prefix — that is prompt hygiene (see 12-context-architecture.md), not a proxy.
- Availability: GATEWAY-OR-SELF-HOST.
- Effort to adopt: high (sidecar GPU/CPU service; Kong Enterprise license) — effort misallocated per the evidence.
- Composability: ANTI-composes with prompt caching (recompression destabilizes the prefix every turn) and with cache-aware routing (every turn looks novel). Composes only with truly cache-cold, one-shot, prose-heavy traffic — which a Claude Code session is not.
- Validation protocol: if attempted anyway: pre-register r and the metric; run >=50 real repo tasks per arm (control vs compressed) measuring TOTAL cost from
usage(input+output+cache classes, not input alone), task pass rate, and output length; abort if cache-read share drops or output expands. The RCT above is the template — and its r=0.2 arm is the expected failure result.
Self-hosted tier (KV-hit-rate engineering)
Engine-level automatic prefix caching (vLLM V1 APC)
One-line pitch: on a self-hosted backend, append-only agent transcripts get Anthropic-style caching economics for free — vLLM V1 enables prefix caching by default with near-zero overhead.
- Layer: cache (self-hosted serving engine).
- Mechanism: KV blocks are content-hashed (parent hash + block tokens + extras like LoRA ID and cache salt; sha256 default as of v0.11) and reused when a request shares a block-aligned prefix; only prefill is skipped — docs: it "does not reduce the time of generating new tokens". Coding-agent transcripts are append-only, so within-session hit rates are naturally high IF the client keeps the prefix byte-stable (set
CLAUDE_CODE_ATTRIBUTION_HEADER=0; see hygiene record — the per-conversation fingerprint block is otherwise prompt bytes to vLLM). - Expected savings: GPU-seconds, not billed tokens (on your own hardware tokens are not the billing unit). vLLM V1 blog : "V1's prefix caching causes less than 1% decrease in throughput even when the cache hit rate is 0%" and "improves the performance several times when the cache hit rate is high"; "we now enable prefix caching by default in V1". Translated to the modeled profile: ~7.56M prompt-side tokens/day would be re-prefilled per day without it; at a Novita-like 80% hit rate only ~20% is recomputed (ESTIMATE).
- Evidence tier: T1 — shipped, default-on, published measurements. Flagged unverified: a SqueezeBits blog claim (prefix-share 0.1→0.9 → +32% throughput) and the default block size (16 tokens) were not confirmed in primary docs.
- Quality risk: NEGATIVE-COST — lossless: identical outputs, strictly less compute; residual hash-collision risk addressed by sha256 default. Degradation would manifest only as a serving bug, not model behavior.
- Availability: GATEWAY-OR-SELF-HOST.
- Effort to adopt: zero once you self-host (default-on); the real cost is operating a serving stack at all.
- Composability: the foundation: LMCache extends it across tiers/instances; cache-aware routing extends it across replicas. Defeated by any nondeterministic upstream mutation (gateway header injection into prompts, attribution block, unstable serializers).
- Validation protocol: replay one recorded Claude Code transcript turn-by-turn against the backend; scrape vLLM's prefix-cache hit-rate metric; assert hit rate >90% of the theoretical append-only maximum and byte-identical completions with APC on vs off (temperature 0).
RadixAttention (SGLang)
One-line pitch: the published-paper version of prefix reuse — "up to 6.4x higher throughput" on prefix-heavy agentic workloads — but the headline is against late-2023 baselines, so read it as "prefix reuse matters", not "SGLang is 6.4x cheaper than vLLM today".
- Layer: cache (self-hosted serving engine).
- Mechanism: retains KV for prompts AND generations in a radix tree with LRU eviction, enabling cross-request prefix sharing at arbitrary granularity (not just block-aligned) — suits tree-structured agent work (subagents forking one transcript, parallel tool branches).
- Expected savings: throughput/$ on own hardware. Abstract : "up to 6.4x higher throughput compared to state-of-the-art inference systems" on agent control, reasoning, JSON decoding, RAG, multi-turn chat. Baselines are stale: vllm v0.2.5 / guidance v0.1.8 / TGI v1.3.0 (original LMSYS blog: "up to 5x"), all pre-dating vLLM's default APC.
- Evidence tier: T2 — arXiv 2312.07104 (v2) with reproducible benchmarks; no venue listed on the arXiv page . Flagged unverified: the "cache hit rates 50-99%" table is from secondary writeups, not read in the PDF.
- Quality risk: NEGATIVE-COST — lossless KV reuse.
- Availability: GATEWAY-OR-SELF-HOST (OpenAI-compatible API; sits behind a LiteLLM-style translation for Claude Code).
- Effort to adopt: same as standing up any inference server.
- Composability: composes with HiCache and sgl-router below; degrades sharply if clients reorder prompt segments between turns (radix match falls back to full prefill).
- Validation protocol: same transcript-replay harness as the vLLM record; additionally fork two subagents from one parent transcript and verify the shared-prefix portion is not recomputed (radix-tree hit metric).
Hierarchical KV cache offload (SGLang HiCache) — the one with real coding-agent numbers
One-line pitch: GPU memory caps the prefix-cache working set; spilling KV to CPU RAM and remote storage is what makes multi-session coding-agent caching actually hit — the only production coding-agent measurement found in this area: Novita AI, Qwen3-Coder-480B, hit rate 40%→80%, TTFT −56%, throughput 2x.
- Layer: cache (self-hosted engine + storage).
- Mechanism: extends RadixAttention with a three-tier cache — GPU HBM (layer-first layout) → CPU DRAM (page-first) → external storage (Mooncake/RDMA, DeepSeek 3FS, NIXL, local file) — with prefetch and write-back, so evicted session prefixes survive between turns and across sessions instead of being recomputed.
- Expected savings: LMSYS HiCache blog : internal "up to 6x throughput improvement and up to 80% reduction in TTFT". Novita AI coding-agent workload (Qwen3-Coder-480B, 25K+-token dialogues, ~8 turns, 3FS backend): "average TTFT dropped by 56%, inference throughput doubled, and the cache hit rate jumped from 40% to 80%". Ant Group (DeepSeek-R1-671B, Mooncake, PD-disaggregated): "cache hits achieved an 84% reduction in TTFT compared to full re-computation". Arithmetic: 40%→80% hit rate means recomputed prefill drops 60%→20% of tokens = 3x less prefill compute (consistent with "throughput doubled" once decode is counted; ESTIMATE).
- Evidence tier: T1 for vendor-published production deployments (named users, named models, concrete configs: 8xH20/8xH800,
--hicache-ratio 2) / T3 for the aggregate ranges. Flagged: Novita/Ant numbers are vendor-reported in the LMSYS blog, not independently reproduced. - Quality risk: NEGATIVE-COST — lossless; KV bytes are moved between tiers, not approximated. Degradation would manifest as latency (prefetch misses), never as different outputs. Falsification: byte-diff outputs with HiCache on/off at temperature 0.
- Availability: GATEWAY-OR-SELF-HOST.
- Effort to adopt: days — CPU-RAM tier is easy; 3FS/Mooncake are real infrastructure; ratio tuning (2:1 published).
- Composability: multiplies APC/RadixAttention — turns within-session caching into across-session and across-replica caching. The 40%→80% delta IS the composability story: GPU-only cache was evicting half the reusable prefixes of a coding-agent fleet. Requires session-affinity routing (next record) to be useful at fleet scale.
- Validation protocol: replay a recorded multi-session jackin' day (6 sessions, interleaved) against SGLang with and without HiCache; report hit rate, recomputed prefill tokens, TTFT p50/p99, and assert temperature-0 output equality. This is also gap #1 below — nobody has published exactly this.
LMCache (KV-cache layer for vLLM: offload + cross-engine sharing)
One-line pitch: the vLLM-ecosystem equivalent of HiCache — an open-source KV layer storing and sharing caches across GPU/CPU/disk/network and across engine instances; abstract claims "up to 15x improvement in throughput" on multi-round QA and doc analysis.
- Layer: cache (self-hosted engine + storage).
- Mechanism: extracts KV out of GPU memory into a managed multi-tier store; prefix-reuse offloading plus prefill/decode disaggregation (cross-GPU cache transfer); integrates with vLLM production-stack (helm charts), remote backends (Redis/Momento-style).
- Expected savings: paper abstract (arXiv 2510.09665): "up to 15x improvement in throughput across workloads such as multi-round question answering and document analysis" with vLLM. Vendor blogs add: ~13x average TTFT reduction for MoE serving (0.29s vs 3.98s; p99 1.30s vs 13.55s) and >50% cold-start TTFT cut with Momento. GPU-time, not billed tokens.
- Evidence tier: T2 (paper, rev) + shipped OSS; vendor blogs are T3 at best. Flagged unverified: evaluation-section figures circulating in summaries (TTFT 4.4-6.6x, 1.9-8.1x query rate) come from search digests, not the PDF; the MoE 13x is LMCache's own blog.
- Quality risk: NEGATIVE-COST — lossless.
- Availability: GATEWAY-OR-SELF-HOST.
- Effort to adopt: days; the most packaged path if your fleet is vLLM rather than SGLang.
- Composability: alternative/complement to HiCache by engine choice; pairs with session-affinity routing; cross-instance sharing is what lets a restarted or rescheduled agent session keep its prefix cache.
- Validation protocol: same multi-session replay harness as HiCache, plus a kill-and-reschedule test: restart the serving pod mid-session and assert the resumed session's hit rate recovers via the remote tier instead of re-prefilling 1.17M tokens.
Cache-aware / session-affinity routing across replicas (sgl-router; X-Claude-Code-Session-Id)
One-line pitch: the silent killer of self-host prefix caching is the load balancer — round-robin sends each turn to a worker that has never seen the session; SGLang's cache-aware router took hit rate 20%→75% and throughput 1.9x, and Claude Code already emits the affinity key any proxy needs.
- Layer: infra (gateway / fleet routing).
- Mechanism: sgl-router maintains a lazily updated "approximate radix tree of the actual radix tree on the workers" — communication-free, "no worker synchronization required" — and routes to the highest-prefix-overlap worker unless load exceeds a balance threshold. Poor-man's version in any proxy: consistent-hash on
X-Claude-Code-Session-Id(officially documented header;X-Claude-Code-Agent-Id/-Parent-Agent-Idfor subagents), no body parsing needed. - Expected savings: SGLang v0.4 blog : "up to 1.9x throughput increase and 3.8x hit rate improvement" vs round-robin; benchmark detail: 82,665 → 158,596 tok/s, hit rate 20% → 75% — on "a workload that has multiple long prefix groups, and each group is perfectly balanced"; source caveat: "performance might vary based on the characteristics of the workload". On a 4-replica fleet, naive round-robin makes a given session hit a warm worker ~25% of turns — routing is the difference between paying prefill 1x and ~4x per fleet (ESTIMATE).
- Evidence tier: T1 — shipped (sglang-router on PyPI) with published benchmark.
- Quality risk: NEGATIVE-COST — pure routing, outputs unchanged. Failure mode is operational only: hot-spotting one replica if affinity ignores load (sgl-router's balance threshold exists for this). Falsification: per-worker load variance under affinity vs round-robin.
- Availability: GATEWAY-OR-SELF-HOST (the session-ID headers are CLAUDE-CODE-TODAY as inputs any proxy can use).
- Effort to adopt: hours — deploy sgl-router, or add one consistent-hash rule on
X-Claude-Code-Session-Idin an existing proxy. - Composability: mandatory companion to every cache technique above once you run >1 replica; also the mechanism for per-subagent cost attribution at the gateway. Anti-synergy: none, except naive consistent-hash without load shedding.
- Validation protocol: run the multi-session replay against a 2+-replica fleet twice — round-robin vs affinity — and report per-worker hit rates, aggregate recomputed-prefill tokens, and p99 TTFT; accept if hit rate gain holds without >20% load skew.
Claims to kill
| Claim | Verdict | Evidence |
|---|---|---|
| "Compress prompts with gzip/base64 before sending" | KILLED (local) | (table above): base64 = 4.33x MORE tokens; gzip+base64 = 2.68x MORE despite 17.8% fewer chars. Tokenizers have no merges for high-entropy encodings. Any gateway feature marketed this way is negative-value. (Phase-0 sample agreed: 4.3x / 2.8x.) |
| "LLMLingua proxy in front of Claude Code cuts input cost 5-20x" | KILLED (arithmetic + T2) | Break-even vs 0.1x cache reads ≈ 5.5x compression (ESTIMATE, modeled profile); at r=2 the session costs ~2.1x MORE. Code tolerates ~10% reduction (TOSEM); RCT: r=0.2 INCREASED total cost 1.8% (arXiv 2603.23525); "adds latency without quality benefits for code generation" (arXiv 2604.02985). |
| TokenMix "$42K/mo → $2.1K with LLMLingua, zero model change" | KILL (unverifiable marketing) | No methodology or workload disclosure (tokenmix.ai, sweep-); contradicts the RCT, the code ceiling, and cache arithmetic. Its 20x anchor is GSM8K math CoT (77.33 vs 78.85 EM), not code. |
| "Semantic caching cuts LLM costs ~68%" applied to coding agents | KILL (number transplant) | 61.6-68.8% hit / 92.5-97.3% positive-hit rates are from consumer-QA categories (arXiv 2411.05276). No vendor publishes a coding-agent hit rate (absence verified, LiteLLM docs). 3-7% wrong-answers-served is a correctness hazard. |
| "Speculative decoding / local draft model cuts your hosted Claude bill" | KILL | No draft/speculative parameter in the live Messages API (full list); speculative decoding emits the SAME tokens faster — it can never reduce a per-token bill anywhere; on self-host it is a latency tool. |
| "SGLang is 6.4x faster, so switching engines saves 6.4x" | MOSTLY-KILL (stale baseline) | 6.4x is vs vllm v0.2.5 / guidance v0.1.8 / TGI v1.3.0 (late 2023, pre-default-APC). Realistic production delta is cache engineering: HiCache 2x / 40%→80%, router 20%→75% — router choice can matter more than engine choice. |
| Caveman plugin "~75% savings" (restated for infra) | CORRECTION | Measured token cut is 58.5% (ultra register); wenyan-full 80.9% CHAR cut = 56.6% TOKEN cut (local Phase-0). Any gateway "compression %" quoted in characters overstates token savings; CJK ~1.47 chars/tok vs 3.35 English. My gzip row above is the same fallacy in the extreme. |
Gaps
- No published end-to-end measurement of Claude Code (or any mainstream coding agent) against a self-hosted backend reporting prefix-cache hit rates, $/session, or quality deltas over a realistic multi-session day. Closest: Novita's vendor-reported HiCache numbers on a proprietary agent. The experiment to publish: Claude Code → LiteLLM → SGLang/vLLM, ± HiCache/LMCache, round-robin vs affinity, reporting hit rate and recomputed tokens.
- No gateway vendor (LiteLLM, Portkey, Helicone) publishes semantic- or exact-cache hit rates at all; every cited hit rate traces to consumer-QA papers.
- No study of code-AWARE compressors (AST-pruning, import-eliding) as a gateway stage, and no study of compression × provider-cache interaction — the RCT did not model cache economics, which the arithmetic here suggests dominates.
- The attribution fingerprint's behavior against self-hosted prefix caches (does it change per turn and re-key the system prompt on vLLM/SGLang?) is documented only as a gateway-cache concern — a 10-minute empirical check with local vLLM + transcript replay would settle it.
- Cache pre-warm break-even:
max_tokens: 0pre-warm requests are officially documented (request shape; rejected if streaming/thinking/structured-outputs/forced-tool_choice are set). ESTIMATE: keeping a 5m cache warm costs 0.1x per ping vs +0.75x once for the 1h TTL, so the 1h write wins when a >35-40min idle gap would need ≥8 warm pings; nobody has validated refresh-on-read end-to-end through gateways.
Confidence note: all dollar arithmetic rests on one locally measured session mix (n=1) plus pricing. The protective conclusions are robust to mix variation; the exact 5.5x break-even and 3.8x broken-cache multipliers move with the cache-read share.
Verification ledger
| # | Number / claim | Source or method | Accessed / run |
|---|---|---|---|
| 1 | Cache read 0.1x; write 1.25x (5m) / 2x (1h); 4 breakpoints; "100% identical prompt segments"; invalidation hierarchy tools→system→messages | https://platform.claude.com/docs/en/build-with-claude/prompt-caching (refetched) | |
| 2 | Workspace-level cache isolation "as of February 5, 2026" (Claude API / Claude Platform on AWS / Microsoft Foundry beta); Bedrock & Vertex org-level | same page (exact quote) | |
| 3 | Min cacheable 512 tok (Fable 5/Mythos 5; 1,024 on Bedrock); 1,024-4,096 older models | same page | |
| 4 | max_tokens: 0 pre-warm request shape + rejection conditions (stream/thinking/structured outputs/forced tool_choice) | same page (curl example present) | |
| 5 | Header-forwarding requirement + "reduced functionality" quote; attribution block + API strips it + CLAUDE_CODE_ATTRIBUTION_HEADER=0; X-Claude-Code-Session-Id/-Agent-Id/-Parent-Agent-Id; /v1/messages/count_tokens required; LiteLLM 1.82.7/1.82.8 malware warning | https://code.claude.com/docs/en/llm-gateway (refetched) | |
| 6 | vLLM V1: "less than 1% decrease in throughput even when the cache hit rate is 0%"; default-on; "several times" at high hit rate | https://vllm.ai/blog/2025-01-27-v1-alpha-release (refetched; source typo "perfix" sic) | |
| 7 | vLLM APC prefill-only scope; block-hash design, sha256 default v0.11 | https://docs.vllm.ai/en/stable/features/automatic_prefix_caching/ , /en/latest/design/prefix_caching/ (sweep-fetched, not refetched) | |
| 8 | SGLang "up to 6.4x higher throughput"; v2; no venue listed | https://arxiv.org/abs/2312.07104 (refetched) | |
| 9 | "Up to 5x" + baselines vllm v0.2.5 / guidance v0.1.8 / TGI v1.3.0 | https://lmsys.org/blog/2024-01-17-sglang/ (sweep-fetched) | |
| 10 | HiCache internal "up to 6x throughput / up to 80% TTFT cut"; Novita Qwen3-Coder-480B 25K-tok ~8-turn: TTFT −56%, throughput 2x, hit 40%→80%; Ant DeepSeek-R1-671B: −84% TTFT; tiers + --hicache-ratio 2 | https://lmsys.org/blog/2025-09-10-sglang-hicache/ (refetched) | |
| 11 | Router "up to 1.9x throughput / 3.8x hit rate"; 82,665→158,596 tok/s; 20%→75%; approximate radix tree, communication-free; workload caveat | https://lmsys.org/blog/2024-12-04-sglang-v0-4/ (refetched) | |
| 12 | LMCache "up to 15x improvement in throughput" (multi-round QA, doc analysis) | https://arxiv.org/abs/2510.09665 (sweep-fetched) | |
| 13 | LMCache MoE ~13x TTFT (0.29s vs 3.98s; p99 1.30 vs 13.55); Momento >50% cold-start | vendor blogs blog.lmcache.ai, gomomento.com (sweep-fetched, vendor-reported) | |
| 14 | Semantic cache 61.6-68.8% hit, 92.5-97.3% positive-hit, consumer-QA categories | https://arxiv.org/abs/2411.05276 (sweep, via search digest — tables not independently read) | |
| 15 | LiteLLM cache features (7 backends, threshold 0.8, TTL, x-litellm-cache-key); NO published hit rates | https://docs.litellm.ai/docs/proxy/caching (sweep-fetched; absence claim) | |
| 16 | LLMLingua up to 20x; GSM8K 77.33 vs 78.85; BBH 7x; 1.7-5.7x latency; LLMLingua-2 2-5x / 1.6-2.9x / 3-6x | https://llmlingua.com/llmlingua.html (sweep-fetched) | |
| 17 | ~10% compression ceiling for code generation | ACM TOSEM https://dl.acm.org/doi/10.1145/3735636 (sweep, via search digest) | |
| 18 | "Adds latency without quality benefits for code generation"; 1.4x on LongBench LCC; >1.3x speedups only on non-optimized stacks; overhead negates gains on vLLM/commercial APIs | https://arxiv.org/html/2604.02985 (sweep-fetched) | |
| 19 | RCT: pre-registered, Claude Sonnet 4.5, 59-61 runs/arm (~358), r=0.5 → −27.9% cost, r=0.2 → +1.8% cost, 1.03x output expansion, recency-weighted −23.5% | https://arxiv.org/abs/2603.23525 (refetched) | |
| 20 | Kong AI Prompt Compressor: Enterprise, Gateway >=3.11, LLMLingua-2 sidecar, no published performance numbers | https://developer.konghq.com/plugins/ai-prompt-compressor/ (sweep-fetched; absence claim) | |
| 21 | Messages API full parameter list; no logprobs/draft/predicted-outputs/KV params | https://platform.claude.com/docs/en/api/messages (sweep-fetched) | |
| 22 | Consumer Terms §3 no-train clause (commercial wording NOT independently verified — flagged) | https://www.anthropic.com/legal/consumer-terms (sweep-fetched) | |
| 23 | Fable 5 $10/$50, Haiku 4.5 $1/$5 per MTok; batch 50% off | reference pricing confirmed against caching/pricing docs in #1 | |
| 24 | base64 4.33x (189→819 tok); gzip+base64 2.68x (189→506; chars 642→528); minify −22.9% (310→239); Fable-vs-Sonnet tokenizer +51% on sample | local measurement, /tmp/ct.py → live count_tokens API, script in dossier workdir (method shown above) | |
| 25 | Session mix 0.44%/6.73%/92.83% prompt-side; dollar split 32/29/20/17/2; ~$22/day; thinking 54.8% of output; MCP schemas 1,420 tok vs ~60 deferred; caveman 58.5%/56.6%/74.5% token cuts | local Phase-0 measurements (see 01-economics-and-measurement.md, 02-baseline-audit.md) | |
| 26 | Derived: 345.2 full-price-equivalent units; ~74% saved by native caching; 3.8x broken-cache; 5.5x compressor break-even; 2.1x at r=2; +6% at r=5; pre-warm ≥8 pings/hr break-even; minify ≈0.3% of session dollars | ESTIMATE — arithmetic shown inline from rows 1, 24, 25 |