13 — Caching exploitation
13 — Caching exploitation
TL;DR
- Caching is on by default in Claude Code and is already the single biggest cost lever: in the live session measured here (claude-fable-5, 300 main-thread API calls), caching cut input-side cost 86.3% — $71.59 paid vs $524.23 uncached-equivalent (local measurement). Prompt mix at that snapshot: 0.05% uncached / 1.90% cache-write / 98.05% cache-read.
- The economics : cache reads 0.1x base input, 5-min-TTL writes 1.25x, 1-hour writes 2x, and "the cache is refreshed for no additional cost each time the cached content is used" — an active session never expires its own cache.
- The exploitation frontier is not turning caching on but (a) not busting the prefix — the official Claude Code docs enumerate exactly 8 invalidators; one avoided bust at a 200k-token history saves $2.30 (5m) / $3.80 (1h); (b) TTL strategy — locally measured: 320/320 main-thread calls wrote 1-hour cache, 1,128/1,128 subagent calls wrote 5-minute cache; (c) structural moves: subagent fan-out ($0.10–0.42 spawn vs 0.1x/turn rent), staggered parallel spawns (~80% off wave input), batch+cache stacking (reads at 0.05x), /rewind over /compact.
- Dollar trap vs quota trap: cache reads are cheap per token but dominate volume — GitHub issue #24147 measured 5,092,500,074 cache-read tokens vs 3,887,759 I/O tokens in 30 days (1,310:1, 99.93%) counting against subscription limits; no official quota-weighting formula exists.
- Nine pieces of caching folklore die on contact with primary sources (see Claims killed): mid-session CLAUDE.md edits do NOT bust the cache, count_tokens does NOT warm it, the 1h TTL needs 2 reads (not 3) to beat uncached, and keepalive pingers solve a problem subscriptions don't have.
Pricing context and the modeled heavy-day profile are in 01-economics-and-measurement.md; the session baseline is in 02-baseline-audit.md.
Price sheet (Fable 5 $10/MTok base input)
| Token class | Multiplier | Fable 5 $/MTok | Source |
|---|---|---|---|
| Uncached input | 1x | $10.00 | platform.claude.com prompt-caching docs |
| Cache read (and TTL refresh) | 0.1x | $1.00 | same; refresh itself is free |
| Cache write, 5-min TTL | 1.25x | $12.50 | same |
| Cache write, 1-hour TTL | 2x | $20.00 | same |
| Batch uncached input | 0.5x | $5.00 | batch docs (50% off stacks with cache) |
| Batch cache read | 0.05x | $0.50 | stacking, batch docs |
| Batch 5-min cache write | 0.625x | $6.25 | stacking, batch docs |
Cache key facts (platform docs): exact prefix match over tools → system → messages; a change anywhere recomputes everything after it. Up to 4 explicit breakpoints; the system checks at most 20 positions per breakpoint (counting itself) when looking back for a prior write. Verification rule: total prompt = input_tokens + cache_creation_input_tokens + cache_read_input_tokens.
What this live session actually paid (local measurement)
Method: parse message.usage from ~/.claude/projects/<project>/<session>.jsonl and its subagents/ directory (script: /tmp/cache_analysis.py plus a TTL summarizer). Two snapshots of the same running session, minutes apart:
| Snapshot | Calls | Uncached in | Cache write | Cache read | Mix (un/wr/rd) | Write bill (1h, 2x) | Read bill (0.1x) |
|---|---|---|---|---|---|---|---|
| Earlier today (sweep) | 95 | — | 723,341 | — | 0.15 / 8.93 / 90.92 % | $14.47 | $7.36 |
| This re-run | 300 | 25,406 | 996,636 | 51,401,035 | 0.05 / 1.90 / 98.05 % | $19.93 | $51.40 |
Two lessons. First, read share climbs over a session's life (90.9% → 98.05%): reads grow roughly quadratically with turn count while writes grow linearly, so "writes exceed reads" (true at 95 calls) flips hard by 300 calls. Optimize writes early in a session's life, reads late. Second, the first turn wrote a 28,669-token fixed prefix (system prompt + tools + project context for this repo) at the 1-hour TTL — that is the quantity every technique below is protecting or shrinking.
Techniques
1. Stable-prefix discipline: avoid the eight documented Claude Code cache busters
One avoided mid-session invalidation = one avoided full-history rewrite at write price instead of read price.
- Layer: turn-structure / session habits
- Mechanism: The API caches by exact prefix; the official Claude Code caching page (code.claude.com) enumerates the invalidators: (1) /model switch — each model has its own cache;
opusplanmakes every plan-mode toggle a model switch; Fable 5's safety-classifier fallback to Opus is also a model switch; (2) /effort change — the cache is keyed by effort level too; CC shows a confirmation dialog before applying; (3) enabling fast mode — header joins the cache key, costed once per conversation (v2.1.86+ keeps the header across toggles; earlier versions invalidate every toggle); (4) MCP server connect/disconnect — only when tools load into the prefix; stdio crashes and HTTP session expiry invalidate with no user action; (5) plugin enable/disable only if it ships MCP servers (v2.1.163+/reload-pluginswarns and refuses a cache-busting reload without--force); (6) bare tool deny rules (Bashremoves the tool def; scopedBash(rm *)is safe); (7) /compact (by design; the summarization call itself reads the old cache); (8) Claude Code upgrade — resuming a long session after one "reprocesses the entire conversation history with no cache hits… can be the most expensive request you send." Cache-SAFE per the same page: editing repo files, editing root/user CLAUDE.md mid-session, output-style changes, permission-mode switches (except opusplan), skills/commands, /recap, /rewind, spawning subagents. - Expected savings: Per-bust delta = (1.25−0.1)=1.15x of the history at 5m TTL, (2−0.1)=1.9x at 1h. At a 200k-token history on Fable 5: $2.30 / $3.80 per avoided bust (gross busted turn $2.50 / $4.00 vs $0.20 warm). ESTIMATE on the modeled day ($22, writes 29%): avoiding two casual mid-task busts at ~150k average history ≈ $3.45/day ≈ 16% of the day.
- Evidence tier: T1 — code.claude.com/docs/en/prompt-caching and platform.claude.com prompt-caching docs; multipliers verified live.
- Quality risk: NEGATIVE-COST. Identical prompts, lower cost and latency; the only constraint is deferring config changes to task boundaries. Degradation impossible by construction; failure mode is human (forgetting and toggling anyway).
- Availability: CLAUDE-CODE-TODAY
- Effort to adopt: Minutes — pick model+effort at session start; don't toggle fast mode deep in; scoped deny rules; schedule /compact and upgrades at task boundaries; don't resume long sessions right after upgrading.
- Composability: Precondition for every other cache technique; pairs with technique 12 to detect violations.
- Validation protocol: Run the same 20-turn task twice; in run B insert one
/modelround-trip mid-task. Diff transcriptusage: run B shows one turn withcache_read≈0andcache_creation≈history size. Assert outputs equivalent (they are — same prompts) and cost delta ≈ history × 1.15 × rate. Watchcache_creation_input_tokensin a statusline thereafter.
2. TTL strategy: CC already splits 1h (main) / 5m (subagents); API keys opt in via ENABLE_PROMPT_CACHING_1H
On subscriptions the 1-hour TTL is free idle insurance, already on. On API keys it costs +60% on writes and pays off only across idle gaps.
- Layer: infra / harness config
- Mechanism: Locally measured (this session's transcripts): main thread 320/320 calls wrote
ephemeral_1h_input_tokens(1,012,031 tok, zero 5m); 1,128/1,128 subagent calls across 21 transcripts wroteephemeral_5m_input_tokens(10,090,230 tok, zero 1h). Official confirmation (code.claude.com): "On a Claude subscription, Claude Code requests the one-hour TTL automatically… Subagents use the five-minute TTL even on a subscription." Drawing on overage credits drops to 5m. API key/Bedrock/Vertex/Foundry default to 5m;ENABLE_PROMPT_CACHING_1H=1opts in,FORCE_PROMPT_CACHING_5M=1forces 5m. Reads refresh the TTL free, so TTL only matters across idle gaps. - Expected savings: A >5-min gap under 5m TTL costs a 1.25x re-write; under 1h it costs a 0.1x read — saving 1.15x of the prefix per gap. Premium paid: +0.75x per token written. This session's 1h premium (had it been API-key billing): 1,012,031 × 0.75 × $10/M = $7.59 with zero >5-min gaps observed — pure loss on gap-free days. One 30-min break at a 400k prefix claws back 400k × 1.15 × $10/M = $4.60. Opt-in rule: Σ(prefix at each 5–60-min gap) × 1.15 must exceed (total tokens written) × 0.75; e.g. one large-prefix break per ~600k written.
- Evidence tier: T1 — local JSONL measurement (method above) + code.claude.com TTL section + platform docs 2x multiplier.
- Quality risk: NEGATIVE-COST on subscription (no per-token billing, only warmer caches); NEUTRAL on API key — pure cost arithmetic, zero output change. Unknown: how 1h writes weight against subscription plan limits (not published).
- Availability: CLAUDE-CODE-TODAY (subscription: already on; API key: one env var)
- Effort to adopt: Minutes (env var) or zero (subscription).
- Composability: Makes keepalive (technique 3) pointless for the CC main thread; gaps >1h still pay full rewrite. Subagents' 5m TTL is fine — they live minutes (see technique 4).
- Validation protocol: On an API key, run a fixed 2-hour workload with one scripted 20-min pause at a known prefix size, once per TTL setting. Compare total
cache_creationbilling: 5m run shows a second full write after the pause; 1h run shows a read. Outputs identical; only billing differs.
3. max_tokens:0 pre-warming and keepalive pings (raw API/SDK)
A documented, GA way to write or refresh a cache without paying for any output.
- Layer: cache / API call pattern
- Mechanism: Platform docs : set
max_tokens: 0; the API "writes the cache at any cache_control breakpoint" and returns an emptycontentarray,stop_reason: "max_tokens", populatedusage; zero output tokens billed; cache-write charge only if the prefix isn't already cached. Docs state it is preferred over and replaces the oldermax_tokens: 1workaround. Rejected with:stream: true, thinking enabled, structured outputs, forcedtool_choice, and inside Batches. Because TTL refresh on use is free, a warm ping costs only the 0.1x read and re-arms the 5-min clock. - Expected savings: For a planned gap of G minutes over prefix P (base-input units of P): ping strategy = 1.25 + 0.1·⌈G/5⌉; up-front 1h TTL = 2.1 (2x write + one 0.1x read); expire-and-rewrite = 2.5. Pings win for G ≤ ~40 min (G=20: 1.65 vs 2.1 vs 2.5); 1h TTL wins for ~40–60 min; past 60 min both expire — eat the rewrite. Each avoided rewrite at a 200k prefix on Fable 5 saves 200k × 1.15 × $10/M = $2.30. Also kills first-request latency for pre-staged prompts.
- Evidence tier: T1 — platform prompt-caching docs. One unverified line item: docs don't explicitly quote the warm-ping read charge (0.1x assumed from the read rule) — see Gaps.
- Quality risk: NEUTRAL — pure pre-computation; only risk is paying for pings nothing ever reads.
- Availability: SDK (not exposed in Claude Code; unnecessary there on subscriptions — main thread is already 1h).
- Effort to adopt: Hours — a timer firing one request with the byte-identical prefix.
- Composability: For 5m-TTL fleets and batch pre-staging (warm the cache synchronously before submitting the batch); mutually exclusive with thinking-enabled requests.
- Validation protocol: Fire a max_tokens:0 request on a cold 10k prefix, assert
cache_creation_input_tokens≈10k,output_tokens=0; repeat within 5 min, assertcache_read_input_tokens≈10kand record the billed amount (closes the 0.1x question); then send the real request and assert full cache read.
4. Subagent fan-out as cache arbitrage: pay $0.10–0.42 once instead of 0.1x rent forever
Every token parked in the main thread is re-read at 0.1x on every later turn; a subagent quarantines exploration in a disposable 5m cache.
- Layer: turn-structure / orchestration
- Mechanism: Official (code.claude.com): a subagent "starts its own conversation with its own system prompt and tool set… no cache hits on its first call… The parent's cache is unaffected." A fork, by contrast, inherits the parent prefix and reads its cache (compaction's summarization call works the same way) — fork when the child needs parent context, spawn when it doesn't. Local measurement (21 subagent transcripts): first-call cache writes ranged 4,859–30,495 tok by agent type (common shape ~9.7–10.2k) plus ~3,910 uncached input → spawn cost $0.10–0.42, typical ~$0.16 on Fable 5. Two subagents each quarantined ~1.25M and ~1.30M tokens of cache writes that never touched the parent prefix.
- Expected savings: Inline, a 1.25M-token exploration costs 1.25M × 0.1 × $10/M = $1.25 in reads per subsequent main-thread turn — $12.50 over the next 10 turns vs $0.16 spawn overhead. Break-even (spawn 16,410 base-units vs inline R×(1.25 + 0.1×T turns remaining)): T=5 → R≥9.4k tok; T=10 → 7.3k; T=20 → 5.0k; T=50 → 2.6k. Fan-out loses money below the break-even — a single grep is cheaper inline.
- Evidence tier: T1 — local transcript measurement (method above) + official subagent/fork semantics.
- Quality risk: NEUTRAL to RISKY — isolation prevents main-context rot (quality positive; see 12-context-architecture.md) but the returned summary is lossy; under-specified subagent prompts cause rework that erases savings. Degradation shows as repeated re-delegation or the main thread re-reading what the subagent already read.
- Availability: CLAUDE-CODE-TODAY (Task tool / custom agents)
- Effort to adopt: Minutes — discipline of choosing spawn vs inline vs fork by expected context volume.
- Composability: Stacks with compressed subagent output (caveman cavecrew claims ~60% smaller tool-results re-entering main context — that shrinks the part of subagent output that does pay rent; measured numbers in 15-output-discipline.md). Anti-synergy: many tiny spawns below break-even.
- Validation protocol: Same research task twice: inline vs delegated. Compare (a) total cost from transcripts, (b) main-thread context size at task end, (c) blind-judged answer quality. Expect ≥30% cost cut and no quality delta for >10k-token explorations with ≥10 turns of session remaining.
5. /rewind instead of /compact to abandon a dead end (cache-warm time travel)
Rewinding truncates to a prefix that is already cached — and still warm even past the TTL — while /compact deliberately invalidates the conversation layer.
- Layer: turn-structure / session habits
- Mechanism: Official (code.claude.com): "/rewind truncates your conversation back to an earlier turn. The remaining history is the same content the cache was built from… so the next request hits the earlier cache entry. Every turn since then has read through that prefix, which kept the entry warm even if the original turn was longer ago than the TTL." Compaction invalidates the conversation layer by design (its summarization call shares the old prefix and reads cache, so generating the summary is cheap input-side; the post-compaction turn rebuilds only the short summary). /recap appends rather than replaces — the cache-safe status check.
- Expected savings: ESTIMATE: rewinding past a 150k-token dead end makes the next turn a $0.15 read (150k × 0.1 × $10/M at the rewind point's prefix) with zero rebuild; compacting instead pays the summary's output tokens (billed at 5x input — e.g. 3k summary ≈ $0.15 on Fable 5), re-writes summary+project context (~30k × 2x × $10/M ≈ $0.60), and keeps a summary of the dead end in context. Tens of cents to low dollars per use; the larger win is structural (no dead-end residue).
- Evidence tier: T1 for mechanism (official quotes); ESTIMATE for dollar figures (arithmetic shown).
- Quality risk: NEUTRAL — rewind discards work by design; used only for genuinely abandoned paths it has no quality cost and removes misleading context. /compact remains right at natural boundaries when old context is genuinely stale.
- Availability: CLAUDE-CODE-TODAY
- Effort to adopt: Zero — habit change.
- Composability: Pairs with checkpointing; orthogonal to TTL; reduces the /compact events that technique 1 schedules.
- Validation protocol: Drive a session into a 50k-token wrong path; branch A /rewind, branch B /compact, then issue the identical next task. Compare next-turn
cache_readvscache_creation, total cost, and whether branch B's answer references the dead end. Expect A: near-pure read; B: write spike + summary cost.
6. Deferred MCP tools make tool churn cache-free (and shrink the prefix)
With tool search (default on supported models), MCP server flap "only appends new content and doesn't disturb anything already cached."
- Layer: input / tool configuration
- Mechanism: Tool definitions render at position 0 (
tools → system → messages), so any in-prefix change invalidates everything. Deferral loads schemas on demand, appended. Deferral is unavailable/disabled on Haiku models, Vertex AI, customANTHROPIC_BASE_URLgateways, and bypassed byalwaysLoador threshold-based loading — those paths still invalidate on any server flap, including silent stdio crashes and HTTP session expiry (code.claude.com). - Expected savings: Two parts: (a) avoided full-history rewrite per server flap — same per-event economics as technique 1 ($2.30–3.80 at 200k); (b) permanent prefix cut — phase-0 local measurement: 11 local MCP tool schemas = 1,420 tok loaded vs ~60 tok as a deferred name list (~96% of schema footprint), which is also 1,360 fewer tokens re-read at 0.1x every turn.
- Evidence tier: T1 — code.claude.com deferral rules + phase-0 local measurement .
- Quality risk: NEGATIVE-COST in cost terms; small retrieval risk that the model fails to search for a deferred tool when needed — manifests as "tool not found/used" misses on niche tools.
- Availability: CLAUDE-CODE-TODAY (default on supported models; audit for
alwaysLoadand gateway setups that silently disable tool search). - Effort to adopt: Minutes (audit) — zero if defaults untouched.
- Composability: The advisor tool sits after the cache breakpoint (toggling /advisor is cache-safe). Plugins shipping MCP servers inherit these rules. Compare the deferral-vs-loaded context numbers in 02-baseline-audit.md.
- Validation protocol: With one flaky stdio server configured
alwaysLoad, kill its process mid-session and observe the next turn'scache_creationspike; repeat with deferral on and observe no spike. Then verify the deferred tool is still discovered and invoked on a task that needs it.
7. Fleet cache sharing via excludeDynamicSections (Agent SDK)
Claude Code has a local file cache scoped to machine/directory/git state; the hosted server prompt cache is workspace/org/model/prefix-scoped. One SDK flag makes a hosted fleet share ONE cached system prompt.
- Layer: infra / SDK fleet architecture
- Mechanism: The
claude_codepreset embeds working directory, git-repo flag, platform, shell, OS version, and auto-memory paths in the system prompt; any difference is a server prompt-cache miss.excludeDynamicSections: true(TS SDK ≥ v0.2.98) /exclude_dynamic_sections: True(Python ≥ v0.1.58) / CLI--exclude-dynamic-system-prompt-sectionsmoves that context into the first user message "so identical configurations share a cache entry across users and machines" (code.claude.com agent-sdk/modifying-system-prompts). Scope correction : the hosted server prompt cache is org/workspace/model/exact-prefix scoped; the machine+directory, startup-git-snapshot, and worktree rules describe Claude Code's local file cache layer, not the billed server cache. CLAUDE.md content is exempt — the SDK injects it into the conversation, not the system prompt. - Expected savings: ESTIMATE: each cold worker otherwise pays its own first-turn write of the fixed prefix — locally measured 28,669 tok for this repo → 28.7k × 1.25 × $10/M = $0.36 per cold worker at 5m TTL, plus re-writes at every TTL expiry across a bursty fleet. For N short-lived CI jobs, the shared entry converts N writes into 1 write + (N−1) reads, saving 1.15 × prefix × (N−1) base-units.
- Evidence tier: T1 for mechanism (official docs); ESTIMATE for fleet dollars (no published figures; dynamic-section size itself unmeasured — see Gaps).
- Quality risk: QUALITY-TRADE (mild, documented): environment context "carr[ies] marginally less weight" in the user message; enable when cross-session reuse outweighs maximally authoritative env context. Degradation would show as the agent mis-trusting cwd or auto-memory paths.
- Availability: SDK
- Effort to adopt: Hours — one flag plus byte-aligning model, effort, tool set, and append text across the fleet.
- Composability: Pairs with 1h TTL for bursty CI and with technique 8 for wave starts. Anti-synergy: any per-worker prompt customization forks the cache.
- Validation protocol: Launch two SDK workers in different directories with the flag on; assert worker 2's first call shows
cache_read>0on the system-prompt segment. Then run an env-sensitive task ("which directory are you in?") and assert correctness despite relocation of the context.
8. Stagger concurrent fan-out: await the first streamed token before firing the rest
N parallel identical-prefix requests all pay write price; staggering converts N−1 writes into reads.
- Layer: cache / API call pattern
- Mechanism: Documented (platform docs): "a cache entry only becomes available after the first response begins. If you need cache hits for parallel requests, wait for the first response before sending subsequent requests." Local corroboration both ways (21 subagent transcripts): the concurrently-spawned first wave showed first-call cache_read=0 across the board, while 9 subagents spawned later — after siblings' responses existed — read 4,802–5,055 tokens of shared prefix on their first call. Staggering happens naturally for sequential spawns; Claude Code does not appear to stagger parallel Task waves.
- Expected savings: For N workers on shared prefix P: naive N × 1.25P; staggered 1.25P + (N−1) × 0.1P. At N=8, P=10k (the measured CC subagent prefix), Fable 5: $1.00 naive vs $0.20 staggered (~80% off the wave's input). Scales with P: a wave of 8 over a 100k shared document: $12.50 vs $1.95. (Corrects the sweep's $0.21/$2.13 figures — arithmetic: (1.25+0.7)×P×$10/M.)
- Evidence tier: T1 — official concurrency rule + local partial-hit evidence.
- Quality risk: NEUTRAL — adds one TTFT of latency to the wave start; outputs unchanged.
- Availability: SDK (one await in your orchestrator); not user-controllable inside Claude Code's Task scheduling today.
- Effort to adopt: Hours — await first streamed token (not completion) of request 1, then fire the rest.
- Composability: Essential for technique 7 fleets, eval harnesses, batch-adjacent fan-out; pair with technique 3 (pre-warm, then fire the whole wave).
- Validation protocol: Spawn two identical-prompt workers 10 s apart and two simultaneously; diff first-call
cache_read(expect ~prefix vs 0). This also settles whether CC's residual 0-read spawns were the race or prompt diffs (see Gaps).
9. Batch API + prompt caching stack for offline/overnight agent work
Batch halves everything including cache prices; official guidance is one shared cache_control prefix per batch and the 1-hour TTL.
- Layer: infra / workload scheduling
- Mechanism: Batch docs (platform.claude.com): "The pricing discounts from prompt caching and Message Batches can stack… cache hits are provided on a best-effort basis. Users typically experience cache hit rates ranging from 30% to 98%, depending on their traffic patterns." Official Tip: "Since batches can take longer than 5 minutes to process, consider using the 1-hour cache duration." Constraints: each batched request needs
max_tokens ≥ 1—max_tokens: 0is rejected "since an ephemeral cache entry written during batch processing would likely expire before the follow-up request runs"; 100,000 requests or 256 MB per batch; most complete <1 h, 24 h expiry; 50% off all token usage.cache_hint/context_hintare listed as sync-only "routing hints" (see Gaps). - Expected savings: ESTIMATE (arithmetic on documented multipliers): overnight 1,000-task eval sharing a 50k prefix on Sonnet 4.6 ($3/MTok): at 90% hit rate, prefix cost = 45M × $0.15/M + 5M × $1.875/M = $6.75 + $9.38 = $16.13 vs $150 sync-uncached (~89% off). At the documented 30% floor: 15M × $0.15/M + 35M × $1.875/M = $67.88 (~55% off). NOTE: this corrects the sweep's "$33.0 at 30%" figure, which doesn't survive the arithmetic. The 30–98% spread makes batch-cache economics ~4x uncertain — measure your own hit rate from result usage fields.
- Evidence tier: T1 for multipliers/hit-rate range (official docs); ESTIMATE for workload dollars.
- Quality risk: NEUTRAL — same models and prompts, async (up to 24 h); expired requests are unbilled.
- Availability: SDK
- Effort to adopt: Days — restructure offline work (evals, repo sweeps, nightly review queues) around an identical leading prefix.
- Composability: Stacks with cheaper models; cannot combine with max_tokens:0 pre-warming inside the batch (warm synchronously beforehand instead) or fast mode.
- Validation protocol: Submit a 100-request pilot batch with one cache_control'd 50k prefix at 1h TTL; compute the realized hit rate from per-result
cache_read_input_tokens; spot-check 10 outputs against sync runs of identical requests for equivalence.
10. Mid-conversation system messages (beta): change operator instructions without burning the prefix
Append {role: "system"} to messages[] instead of editing the top-level system field.
- Layer: input / API call pattern
- Mechanism: Platform docs : "On Claude Opus 4.8, you can add a new system instruction partway through a conversation without invalidating the system or message caches. Append a
{"role": "system"}message tomessagesinstead of editing the top-levelsystemfield, so the cached prefix stays unchanged." Claude Code uses the same append-only shape internally: plan mode and skill loading "append their instructions as conversation messages, so the cached prefix stays intact" (code.claude.com). Fallback on other models:<system-reminder>text in a user turn (same cache profile, weaker authority). - Expected savings: One avoided full-prefix invalidation per mid-session instruction change — per-event economics of technique 1 ($2.30–3.80 at a 200k history on Fable 5).
- Evidence tier: T1 for the documented behavior; model support beyond Opus 4.8 unverified (docs name only Opus 4.8 — see Gaps).
- Quality risk: NEUTRAL — docs advise phrasing as context, not override commands; instruction adherence from message position is marginally weaker than top-level system.
- Availability: SDK (beta header
mid-conversation-system-2026-04-07); already CC's internal pattern, not user-invocable. - Effort to adopt: Hours for SDK harness authors.
- Composability: The designated escape hatch for frozen-system-prompt discipline (no timestamps/UUIDs/flags interpolated into system); pairs with techniques 1 and 7.
- Validation protocol: 50-turn synthetic session; at turn 25 inject a policy change via appended system message vs edited top-level system. Assert (a) cache_read continuity in variant A, full re-write in variant B; (b) equal instruction-following rates on the new policy across the remaining 25 turns.
11. Know the minimum cacheable prefix — model-specific, 512 to 4,096 tokens
Below the per-model minimum, cache_control silently does nothing; no error is returned.
- Layer: cache / API mechanics
- Mechanism: Live table (platform docs): Fable 5 / Mythos 5 = 512 (but 1,024 on Bedrock); Mythos Preview = 2,048; Opus 4.8 = 1,024; Opus 4.7 = 2,048; Opus 4.6 / 4.5 = 4,096; Opus 4.1 / 4 = 1,024; Sonnet 4.6 / 4.5 / 4 = 1,024; Haiku 4.5 = 4,096; Haiku 3.5 = 2,048. "Any requests to cache fewer than this number of tokens will be processed without caching, and no error is returned." Detection: both
cache_creation_input_tokensandcache_read_input_tokensstay 0. The counterintuitive ordering: Haiku 4.5 — the model picked for cheap high-frequency calls — has the HIGHEST floor; a 3.5k classifier prompt is uncacheable on Haiku 4.5 but caches fine on Sonnet 4.6. - Expected savings: Avoids silent 100% cache-miss on small prompts. ESTIMATE: a 3.5k-token prompt called 1,000x/day pays 3.5M × $1/M = $3.50/day uncached on Haiku 4.5; padded ≥4,096 (or moved to Sonnet 4.6) it pays ~0.1x on hits — order of $0.35–1.05/day equivalent. Workload-specific; verify against your shapes. Local relevance: this repo's always-on
AGENTS.mdalone = 2,744 tok on Fable 5 (measured via /tmp/ct.py; 1,919 on Sonnet 4.6 — the +43% tokenizer gap compounds cache-write premiums) — above Fable/Sonnet floors, below Haiku 4.5's; CC's full 28.7k fixed prefix clears every minimum. - Evidence tier: T1 — live table, two consistent fetches; local token counts via /tmp/ct.py.
- Quality risk: NEGATIVE-COST — pure accounting awareness; padding a prompt to cross a floor is the only act with any content effect (keep padding semantically inert).
- Availability: SDK (Claude Code prompts always clear the minimums)
- Effort to adopt: Minutes — a CI assertion that
cache_creation>0on prompts you believe cache. - Composability: Gate check for techniques 3, 7, 8, 9 on small prompts. These minimums have already moved between doc revisions — never quote from memory.
- Validation protocol: Send a ~600-token cache_control'd request to Fable 5 (above its 512 floor) and the same to Haiku 4.5 (below its 4,096): assert cache fields populate for the former and stay 0 for the latter; also settles whether the minimum counts tools+system+messages cumulatively.
12. Measure before optimizing: statusline, transcript JSONL, OTel
Cache problems are diagnosable from fields you already have, free.
- Layer: infra / observability
- Mechanism: Statusline scripts read
current_usagewithcache_creation_input_tokens/cache_read_input_tokens; official heuristic: "A high read-to-creation ratio means caching is working well. If creation stays high turn after turn, something is changing in your prefix" (code.claude.com). Org-wide: OTel exporter reports cache read/creation per user and session. Locally,~/.claude/projects/**/*.jsonlcarries per-callmessage.usageincluding theephemeral_5m/1hsplit — that is how every local number in this file was produced, with a 30-line parser. API-side: ifcache_readis 0 across identical-prefix requests, hunt the silent invalidator (timestamps, unsorted JSON, varying tool sets). - Expected savings: Enabler, not a saver. Healthy reference points (local): this session 98.05% reads / 1.90% writes / 0.05% uncached at 300 calls; phase-0 heavy session 92.83 / 6.73 / 0.44. Key diagnostic insight from the two snapshots: write-vs-read dollar dominance flips with session age ($14.47 writes > $7.36 reads at 95 calls; $19.93 writes < $51.40 reads at 300) — target write hygiene early, read volume (context size, see 12-context-architecture.md) late. Quota angle: issue #24147 measured 1,310 cache-read tokens per productive I/O token over 30 days — in quota terms reads are the whole game even when dollars say writes.
- Evidence tier: T1 (official fields + local reproduction); the #24147 quota-weighting claim itself is unconfirmed (no Anthropic response on the issue) — the ratio is real, the weighting is not published.
- Quality risk: NEGATIVE-COST.
- Availability: CLAUDE-CODE-TODAY
- Effort to adopt: Minutes — statusline script or JSONL parser (/tmp/cache_analysis.py pattern); validation harness integration in 31-validation-harness.md.
- Composability: Feeds every technique above; before quoting community CLI percentages (ccusage etc.), check their hit-rate formula (reads/(reads+writes+uncached) vs per-turn ratios differ).
- Validation protocol: Self-validating — recompute one session's cost from JSONL and reconcile against console billing / usage page to ±2%.
Claims killed (each verified against primary sources)
- "Editing CLAUDE.md mid-session busts your cache." Opposite is documented: root/user CLAUDE.md "are read once at session start… Editing them mid-session does not invalidate the cache, but the edit also doesn't apply" until /clear, /compact, or restart (code.claude.com). Community posts blaming CLAUDE.md edits for cache spikes are wrong for current CC root-level files; only nested CLAUDE.md /
paths:-frontmatter rules load lazily. - "The 1-hour cache needs ≥3 reads to pay off." Vs no caching: 2x write + n×0.1 reads < (1+n)×1 at n=2 (2.2 < 3.0); n=1 narrowly loses (2.1 > 2.0). Vs 5m TTL in gap-prone traffic: a single >5-min-gap reuse wins (2.1 < 2.5). The "3" is a garbled "at least three requests" (= 1 write + 2 reads).
- "Ping count_tokens to keep the cache warm — it's free." Free, yes; cache-inert, explicitly: "No, token counting provides an estimate without using caching logic… prompt caching only occurs during actual message creation" (token-counting docs FAQ). Separate RPM pool: 100 / 2,000 / 4,000 / 8,000 by tier. The real warmer is max_tokens:0 (technique 3).
- "max_tokens must be ≥1, so warmers waste an output token." Outdated — max_tokens:0 is GA, bills zero output, and docs say it replaces the max_tokens:1 workaround. Still true inside batches only (max_tokens ≥ 1 there, with a documented reason).
- "Switching permission modes busts the cache." "Mode changes are cache-safe" — sole exception is plan mode under
opusplan, where the cost is the MODEL switch (code.claude.com). Also safe: output-style changes (don't apply mid-session), scoped deny rules, skills, /recap. - "Adding/removing an MCP server mid-session always invalidates." Only when tool definitions load into the prefix; with deferred tools (the default on supported models) a server flap "only appends new content." Survives only for Haiku models, Vertex AI, custom gateways,
alwaysLoad, threshold-loaded definitions. - "Cache reads are basically free, ignore them." Dollars: $51.40 (largest input-side line, 72% of this session's input bill) at 300 calls (local). Quota: issue #24147 — 5,092,500,074 reads vs 3,887,759 I/O tokens in 30 days (1,310:1, 99.93%), open, no official response on weighting.
- "Minimum cacheable prompt is 1,024 tokens." Stale — live table is per-model, 512–4,096 (technique 11), differs by platform (Bedrock), and has changed between doc revisions. Even the bundled claude-api skill's copy disagrees with the live page.
- "Set up a keepalive pinger for Claude Code sessions." Subscriptions already write the main thread at 1h TTL (locally: 320/320 calls); sub-hour breaks cost a 0.1x re-read, and pure-API pinging isn't reachable from inside CC anyway. Keepalive is rational only for raw-SDK 5m-TTL callers with planned gaps under ~40 min (technique 3 arithmetic).
Surprising findings
- The split-TTL strategy (1h main / 5m subagents) is real, automatic, and measured here at 100%/100% — and at 2x write price it made writes the dominant input cost early in the session before reads took over.
- The cache is keyed by effort level, not just model — /effort mid-session re-reads the whole history uncached, which is why CC now interposes a confirmation dialog.
- /rewind hits cache entries older than the TTL because every intervening turn refreshed the shared prefix for free.
- Your git state affects Claude Code's local file-cache reuse: sequential sessions share that local layer only if the startup branch+commits snapshot matches; worktrees of one repo never share. The billed server prompt cache is separately scoped by workspace/org/model/exact prefix (see technique 7).
- MCP servers can bust the cache with no user action (stdio crash, HTTP session expiry, auto-reconnect) when their tools are in-prefix.
- Sequentially-spawned subagents got 4,802–5,055-token first-call cache hits where the concurrent wave got zero — the documented concurrency race, visible in the wild (local).
Gaps / unresolved
- Quota weighting of cache reads on subscriptions — #24147 demonstrates the 1,310:1 ratio; Anthropic publishes no formula. "Reads are cheap" is true in dollars, unprovable in quota.
cache_hint/context_hint— surfaced in the batch docs' not-supported table as sync-only "routing hints"; no dedicated docs page found. Potentially a whole technique (cache-aware request routing); watch for documentation.- max_tokens:0 warm-ping billing — write-charge-if-cold and zero output are documented; the 0.1x read line item on a warm ping is inferred, one live request away from closed (protocol in technique 3).
- 20-position lookback in CC practice — API rule verified; whether CC's breakpoint placement can miss it on >20-block tool-heavy turns is unobserved harness internals.
- Mixed-TTL strategies (5m normally, one 1h write before a planned break) — docs have a mixing section not fully verified here; technique 3's table assumes one TTL per prefix.
- Dynamic-section token size for technique 7 — measurable by diffing count_tokens over two cwd variants of the preset; not yet done.
- Model support list for mid-conversation system messages beyond Opus 4.8 (Fable 5 implied by family, not quoted).
- The community "99.5% sustained hit rate" post (vsits.co) 403'd on direct fetch — search-snippet only, excluded from evidence (unverified T3).
Verification ledger
| # | Number / claim | Source or method (all accessed/) |
|---|---|---|
| 1 | 0.1x read / 1.25x 5m write / 2x 1h write; free refresh on use | https://platform.claude.com/docs/en/build-with-claude/prompt-caching.md (live fetch) |
| 2 | max_tokens:0 semantics, zero output billed, rejection list, replaces max_tokens:1 | same page |
| 3 | Min cacheable prefix table: Fable 5=512 (Bedrock 1,024), Opus 4.8=1,024, Opus 4.7=2,048, Opus 4.6/4.5=4,096, Sonnet 4.6/4.5=1,024, Haiku 4.5=4,096, Haiku 3.5=2,048; silent no-error failure | same page |
| 4 | Concurrency rule (entry available after first response begins) | same page |
| 5 | 4 breakpoints; 20-position lookback per breakpoint | same page |
| 6 | Usage sum rule (input + cache_creation + cache_read) | same page |
| 7 | Opus 4.8 mid-conversation {role:"system"} exception | same page |
| 8 | 8 invalidators incl. effort-level keying, opusplan, fast-mode header (v2.1.86+), /reload-plugins guard (v2.1.163+), upgrade-resume warning | https://code.claude.com/docs/en/prompt-caching (live fetch) |
| 9 | Cache-safe list; CLAUDE.md "read once… edit doesn't apply" quote | same page |
| 10 | TTL policy: subscription 1h auto, subagents 5m, ENABLE_PROMPT_CACHING_1H, FORCE_PROMPT_CACHING_5M, credits→5m; DISABLE_PROMPT_CACHING vars | same page |
| 11 | /rewind past-TTL warmth quote; /recap appends; compaction cost anatomy | same page |
| 12 | Subagent own-cache/5m; fork inherits parent cache | same page |
| 13 | Cache scope correction: hosted server cache = workspace/org/model/exact prefix; Claude Code local file cache = machine+directory/git snapshot/worktree rules; workspace isolation | code.claude.com prompt-caching + platform prompt-caching; corrected |
| 14 | excludeDynamicSections: TS ≥v0.2.98 / Py ≥v0.1.58, CLI flag, "share a cache entry across users and machines", "marginally less weight" tradeoff; CLAUDE.md injected into conversation | https://code.claude.com/docs/en/agent-sdk/modifying-system-prompts (live fetch) |
| 15 | Batch+cache stacking; 30–98% best-effort hit rates; 1h-TTL tip; max_tokens≥1 in batches with quoted reason; cache_hint/context_hint sync-only; 100k req / 256 MB / <1h typical / 24h expiry; 50% discount | https://platform.claude.com/docs/en/build-with-claude/batch-processing.md (live fetch, persisted to tool-results) |
| 16 | count_tokens cache-inert FAQ; free; RPM 100/2,000/4,000/8,000 | https://platform.claude.com/docs/en/build-with-claude/token-counting.md (live fetch) |
| 17 | Fable 5/Mythos 5 tokenizer "roughly 30% more tokens" than pre-Opus-4.7 models | same page (consistent with phase-0 local +15% code / +38% prose) |
| 18 | #24147: 5,092,500,074 cache reads vs 3,887,759 I/O over 30 days (Jan 9–Feb 8 2026); 1,310:1; 99.93%; open; no official response | https://github.com/anthropics/claude-code/issues/24147 (live fetch) |
| 19 | Main thread 320/320 calls 1h TTL (1,012,031 tok, zero 5m); subagents 1,128/1,128 calls 5m (10,090,230 tok, zero 1h) | Local: TTL summarizer over ~/.claude/projects/.../2e1d1f1f*.jsonl + subagents/ |
| 20 | 300-call snapshot: 25,406 uncached / 996,636 write / 51,401,035 read / 377,159 out; mix 0.05/1.90/98.05%; $0.25+$19.93+$51.40=$71.59 vs $524.23 uncached-equivalent = 86.3% saved | Local: /tmp/cache_analysis.py over same transcripts |
| 21 | Earlier same-session snapshot: 95 calls, mix 0.15/8.93/90.92%, writes $14.47 > reads $7.36, 723,341 1h tokens | Sweep-agent local measurement earlier (same session, superseded totals; retained for the write/read flip) |
| 22 | First-turn fixed-prefix write 28,669 tok (1h) | Local: first assistant message usage in main transcript |
| 23 | Subagent first calls: 11/21 cache_read=0; 9/21 read 4,802–5,055; first-call writes 4,859–30,495 + ~3,910 uncached → spawn $0.10–0.42 | Local: per-transcript first-call usage |
| 24 | AGENTS.md = 2,744 tok (claude-fable-5) vs 1,919 (claude-sonnet-4-6), +43% | Local: count_tokens run on AGENTS.md |
| 25 | 11 MCP tool schemas = 1,420 tok loaded vs ~60 deferred; modeled day $22 (reads 32%/writes 29%/thinking 20%/visible 17%/uncached 2%) | Phase-0 local measurements (given as ground truth) |
| 26 | Per-bust $2.30/$3.80 at 200k; ping table (1.25+0.1·⌈G/5⌉ vs 2.1 vs 2.5, ~40/60-min crossovers); break-even table (9.4k/7.3k/5.0k/2.6k @ T=5/10/20/50); stagger $1.00→$0.20 (N=8, P=10k); batch eval $16.13 @90% / $67.88 @30% vs $150; 1h premium $7.59 this session; $0.36/cold worker; n=2 break-even for 1h vs uncached | ESTIMATE — arithmetic shown inline from rows 1, 15, 19–23 |
| 27 | vsits.co "99.5% hit rate" | UNVERIFIED (T3, fetch 403'd) — excluded from evidence-bearing claims |