13 — Caching exploitation

TL;DR

Caching is on by default in Claude Code and is already the single biggest cost lever: in the live session measured here (claude-fable-5, 300 main-thread API calls), caching cut input-side cost 86.3% — $71.59 paid vs $524.23 uncached-equivalent (local measurement). Prompt mix at that snapshot: 0.05% uncached / 1.90% cache-write / 98.05% cache-read.
The economics : cache reads 0.1x base input, 5-min-TTL writes 1.25x, 1-hour writes 2x, and "the cache is refreshed for no additional cost each time the cached content is used" — an active session never expires its own cache.
The exploitation frontier is not turning caching on but (a) not busting the prefix — the official Claude Code docs enumerate exactly 8 invalidators; one avoided bust at a 200k-token history saves $2.30 (5m) / $3.80 (1h); (b) TTL strategy — locally measured: 320/320 main-thread calls wrote 1-hour cache, 1,128/1,128 subagent calls wrote 5-minute cache; (c) structural moves: subagent fan-out ($0.10–0.42 spawn vs 0.1x/turn rent), staggered parallel spawns (~80% off wave input), batch+cache stacking (reads at 0.05x), /rewind over /compact.
Dollar trap vs quota trap: cache reads are cheap per token but dominate volume — GitHub issue #24147 measured 5,092,500,074 cache-read tokens vs 3,887,759 I/O tokens in 30 days (1,310:1, 99.93%) counting against subscription limits; no official quota-weighting formula exists.
Nine pieces of caching folklore die on contact with primary sources (see Claims killed): mid-session CLAUDE.md edits do NOT bust the cache, count_tokens does NOT warm it, the 1h TTL needs 2 reads (not 3) to beat uncached, and keepalive pingers solve a problem subscriptions don't have.

Pricing context and the modeled heavy-day profile are in 01-economics-and-measurement.md; the session baseline is in 02-baseline-audit.md.

Price sheet (Fable 5 $10/MTok base input)

Token class	Multiplier	Fable 5 $/MTok	Source
Uncached input	1x	$10.00	platform.claude.com prompt-caching docs
Cache read (and TTL refresh)	0.1x	$1.00	same; refresh itself is free
Cache write, 5-min TTL	1.25x	$12.50	same
Cache write, 1-hour TTL	2x	$20.00	same
Batch uncached input	0.5x	$5.00	batch docs (50% off stacks with cache)
Batch cache read	0.05x	$0.50	stacking, batch docs
Batch 5-min cache write	0.625x	$6.25	stacking, batch docs

Cache key facts (platform docs): exact prefix match over tools → system → messages; a change anywhere recomputes everything after it. Up to 4 explicit breakpoints; the system checks at most 20 positions per breakpoint (counting itself) when looking back for a prior write. Verification rule: total prompt = input_tokens + cache_creation_input_tokens + cache_read_input_tokens.

What this live session actually paid (local measurement)

Method: parse message.usage from ~/.claude/projects/<project>/<session>.jsonl and its subagents/ directory (script: /tmp/cache_analysis.py plus a TTL summarizer). Two snapshots of the same running session, minutes apart:

Snapshot	Calls	Uncached in	Cache write	Cache read	Mix (un/wr/rd)	Write bill (1h, 2x)	Read bill (0.1x)
Earlier today (sweep)	95	—	723,341	—	0.15 / 8.93 / 90.92 %	$14.47	$7.36
This re-run	300	25,406	996,636	51,401,035	0.05 / 1.90 / 98.05 %	$19.93	$51.40

Two lessons. First, read share climbs over a session's life (90.9% → 98.05%): reads grow roughly quadratically with turn count while writes grow linearly, so "writes exceed reads" (true at 95 calls) flips hard by 300 calls. Optimize writes early in a session's life, reads late. Second, the first turn wrote a 28,669-token fixed prefix (system prompt + tools + project context for this repo) at the 1-hour TTL — that is the quantity every technique below is protecting or shrinking.

Techniques

1. Stable-prefix discipline: avoid the eight documented Claude Code cache busters

One avoided mid-session invalidation = one avoided full-history rewrite at write price instead of read price.

Layer: turn-structure / session habits
Mechanism: The API caches by exact prefix; the official Claude Code caching page (code.claude.com) enumerates the invalidators: (1) /model switch — each model has its own cache; opusplan makes every plan-mode toggle a model switch; Fable 5's safety-classifier fallback to Opus is also a model switch; (2) /effort change — the cache is keyed by effort level too; CC shows a confirmation dialog before applying; (3) enabling fast mode — header joins the cache key, costed once per conversation (v2.1.86+ keeps the header across toggles; earlier versions invalidate every toggle); (4) MCP server connect/disconnect — only when tools load into the prefix; stdio crashes and HTTP session expiry invalidate with no user action; (5) plugin enable/disable only if it ships MCP servers (v2.1.163+ /reload-plugins warns and refuses a cache-busting reload without --force); (6) bare tool deny rules (Bash removes the tool def; scoped Bash(rm *) is safe); (7) /compact (by design; the summarization call itself reads the old cache); (8) Claude Code upgrade — resuming a long session after one "reprocesses the entire conversation history with no cache hits… can be the most expensive request you send." Cache-SAFE per the same page: editing repo files, editing root/user CLAUDE.md mid-session, output-style changes, permission-mode switches (except opusplan), skills/commands, /recap, /rewind, spawning subagents.
Expected savings: Per-bust delta = (1.25−0.1)=1.15x of the history at 5m TTL, (2−0.1)=1.9x at 1h. At a 200k-token history on Fable 5: $2.30 / $3.80 per avoided bust (gross busted turn $2.50 / $4.00 vs $0.20 warm). ESTIMATE on the modeled day ($22, writes 29%): avoiding two casual mid-task busts at ~150k average history ≈ $3.45/day ≈ 16% of the day.
Evidence tier: T1 — code.claude.com/docs/en/prompt-caching and platform.claude.com prompt-caching docs; multipliers verified live.
Quality risk: NEGATIVE-COST. Identical prompts, lower cost and latency; the only constraint is deferring config changes to task boundaries. Degradation impossible by construction; failure mode is human (forgetting and toggling anyway).
Availability: CLAUDE-CODE-TODAY
Effort to adopt: Minutes — pick model+effort at session start; don't toggle fast mode deep in; scoped deny rules; schedule /compact and upgrades at task boundaries; don't resume long sessions right after upgrading.
Composability: Precondition for every other cache technique; pairs with technique 12 to detect violations.
Validation protocol: Run the same 20-turn task twice; in run B insert one /model round-trip mid-task. Diff transcript usage: run B shows one turn with cache_read≈0 and cache_creation≈history size. Assert outputs equivalent (they are — same prompts) and cost delta ≈ history × 1.15 × rate. Watch cache_creation_input_tokens in a statusline thereafter.

2. TTL strategy: CC already splits 1h (main) / 5m (subagents); API keys opt in via ENABLE_PROMPT_CACHING_1H

On subscriptions the 1-hour TTL is free idle insurance, already on. On API keys it costs +60% on writes and pays off only across idle gaps.

Layer: infra / harness config
Mechanism: Locally measured (this session's transcripts): main thread 320/320 calls wrote ephemeral_1h_input_tokens (1,012,031 tok, zero 5m); 1,128/1,128 subagent calls across 21 transcripts wrote ephemeral_5m_input_tokens (10,090,230 tok, zero 1h). Official confirmation (code.claude.com): "On a Claude subscription, Claude Code requests the one-hour TTL automatically… Subagents use the five-minute TTL even on a subscription." Drawing on overage credits drops to 5m. API key/Bedrock/Vertex/Foundry default to 5m; ENABLE_PROMPT_CACHING_1H=1 opts in, FORCE_PROMPT_CACHING_5M=1 forces 5m. Reads refresh the TTL free, so TTL only matters across idle gaps.
Expected savings: A >5-min gap under 5m TTL costs a 1.25x re-write; under 1h it costs a 0.1x read — saving 1.15x of the prefix per gap. Premium paid: +0.75x per token written. This session's 1h premium (had it been API-key billing): 1,012,031 × 0.75 × $10/M = $7.59 with zero >5-min gaps observed — pure loss on gap-free days. One 30-min break at a 400k prefix claws back 400k × 1.15 × $10/M = $4.60. Opt-in rule: Σ(prefix at each 5–60-min gap) × 1.15 must exceed (total tokens written) × 0.75; e.g. one large-prefix break per ~600k written.
Evidence tier: T1 — local JSONL measurement (method above) + code.claude.com TTL section + platform docs 2x multiplier.
Quality risk: NEGATIVE-COST on subscription (no per-token billing, only warmer caches); NEUTRAL on API key — pure cost arithmetic, zero output change. Unknown: how 1h writes weight against subscription plan limits (not published).
Availability: CLAUDE-CODE-TODAY (subscription: already on; API key: one env var)
Effort to adopt: Minutes (env var) or zero (subscription).
Composability: Makes keepalive (technique 3) pointless for the CC main thread; gaps >1h still pay full rewrite. Subagents' 5m TTL is fine — they live minutes (see technique 4).
Validation protocol: On an API key, run a fixed 2-hour workload with one scripted 20-min pause at a known prefix size, once per TTL setting. Compare total cache_creation billing: 5m run shows a second full write after the pause; 1h run shows a read. Outputs identical; only billing differs.

3. max_tokens:0 pre-warming and keepalive pings (raw API/SDK)

A documented, GA way to write or refresh a cache without paying for any output.

Layer: cache / API call pattern
Mechanism: Platform docs : set max_tokens: 0; the API "writes the cache at any cache_control breakpoint" and returns an empty content array, stop_reason: "max_tokens", populated usage; zero output tokens billed; cache-write charge only if the prefix isn't already cached. Docs state it is preferred over and replaces the older max_tokens: 1 workaround. Rejected with: stream: true, thinking enabled, structured outputs, forced tool_choice, and inside Batches. Because TTL refresh on use is free, a warm ping costs only the 0.1x read and re-arms the 5-min clock.
Expected savings: For a planned gap of G minutes over prefix P (base-input units of P): ping strategy = 1.25 + 0.1·⌈G/5⌉; up-front 1h TTL = 2.1 (2x write + one 0.1x read); expire-and-rewrite = 2.5. Pings win for G ≤ ~40 min (G=20: 1.65 vs 2.1 vs 2.5); 1h TTL wins for ~40–60 min; past 60 min both expire — eat the rewrite. Each avoided rewrite at a 200k prefix on Fable 5 saves 200k × 1.15 × $10/M = $2.30. Also kills first-request latency for pre-staged prompts.
Evidence tier: T1 — platform prompt-caching docs. One unverified line item: docs don't explicitly quote the warm-ping read charge (0.1x assumed from the read rule) — see Gaps.
Quality risk: NEUTRAL — pure pre-computation; only risk is paying for pings nothing ever reads.
Availability: SDK (not exposed in Claude Code; unnecessary there on subscriptions — main thread is already 1h).
Effort to adopt: Hours — a timer firing one request with the byte-identical prefix.
Composability: For 5m-TTL fleets and batch pre-staging (warm the cache synchronously before submitting the batch); mutually exclusive with thinking-enabled requests.
Validation protocol: Fire a max_tokens:0 request on a cold 10k prefix, assert cache_creation_input_tokens≈10k, output_tokens=0; repeat within 5 min, assert cache_read_input_tokens≈10k and record the billed amount (closes the 0.1x question); then send the real request and assert full cache read.

4. Subagent fan-out as cache arbitrage: pay $0.10–0.42 once instead of 0.1x rent forever

Every token parked in the main thread is re-read at 0.1x on every later turn; a subagent quarantines exploration in a disposable 5m cache.

Layer: turn-structure / orchestration
Mechanism: Official (code.claude.com): a subagent "starts its own conversation with its own system prompt and tool set… no cache hits on its first call… The parent's cache is unaffected." A fork, by contrast, inherits the parent prefix and reads its cache (compaction's summarization call works the same way) — fork when the child needs parent context, spawn when it doesn't. Local measurement (21 subagent transcripts): first-call cache writes ranged 4,859–30,495 tok by agent type (common shape ~9.7–10.2k) plus ~3,910 uncached input → spawn cost $0.10–0.42, typical ~$0.16 on Fable 5. Two subagents each quarantined ~1.25M and ~1.30M tokens of cache writes that never touched the parent prefix.
Expected savings: Inline, a 1.25M-token exploration costs 1.25M × 0.1 × $10/M = $1.25 in reads per subsequent main-thread turn — $12.50 over the next 10 turns vs $0.16 spawn overhead. Break-even (spawn 16,410 base-units vs inline R×(1.25 + 0.1×T turns remaining)): T=5 → R≥9.4k tok; T=10 → 7.3k; T=20 → 5.0k; T=50 → 2.6k. Fan-out loses money below the break-even — a single grep is cheaper inline.
Evidence tier: T1 — local transcript measurement (method above) + official subagent/fork semantics.
Quality risk: NEUTRAL to RISKY — isolation prevents main-context rot (quality positive; see 12-context-architecture.md) but the returned summary is lossy; under-specified subagent prompts cause rework that erases savings. Degradation shows as repeated re-delegation or the main thread re-reading what the subagent already read.
Availability: CLAUDE-CODE-TODAY (Task tool / custom agents)
Effort to adopt: Minutes — discipline of choosing spawn vs inline vs fork by expected context volume.
Composability: Stacks with compressed subagent output (caveman cavecrew claims ~60% smaller tool-results re-entering main context — that shrinks the part of subagent output that does pay rent; measured numbers in 15-output-discipline.md). Anti-synergy: many tiny spawns below break-even.
Validation protocol: Same research task twice: inline vs delegated. Compare (a) total cost from transcripts, (b) main-thread context size at task end, (c) blind-judged answer quality. Expect ≥30% cost cut and no quality delta for >10k-token explorations with ≥10 turns of session remaining.

5. /rewind instead of /compact to abandon a dead end (cache-warm time travel)

Rewinding truncates to a prefix that is already cached — and still warm even past the TTL — while /compact deliberately invalidates the conversation layer.

Layer: turn-structure / session habits
Mechanism: Official (code.claude.com): "/rewind truncates your conversation back to an earlier turn. The remaining history is the same content the cache was built from… so the next request hits the earlier cache entry. Every turn since then has read through that prefix, which kept the entry warm even if the original turn was longer ago than the TTL." Compaction invalidates the conversation layer by design (its summarization call shares the old prefix and reads cache, so generating the summary is cheap input-side; the post-compaction turn rebuilds only the short summary). /recap appends rather than replaces — the cache-safe status check.
Expected savings: ESTIMATE: rewinding past a 150k-token dead end makes the next turn a $0.15 read (150k × 0.1 × $10/M at the rewind point's prefix) with zero rebuild; compacting instead pays the summary's output tokens (billed at 5x input — e.g. 3k summary ≈ $0.15 on Fable 5), re-writes summary+project context (~30k × 2x × $10/M ≈ $0.60), and keeps a summary of the dead end in context. Tens of cents to low dollars per use; the larger win is structural (no dead-end residue).
Evidence tier: T1 for mechanism (official quotes); ESTIMATE for dollar figures (arithmetic shown).
Quality risk: NEUTRAL — rewind discards work by design; used only for genuinely abandoned paths it has no quality cost and removes misleading context. /compact remains right at natural boundaries when old context is genuinely stale.
Availability: CLAUDE-CODE-TODAY
Effort to adopt: Zero — habit change.
Composability: Pairs with checkpointing; orthogonal to TTL; reduces the /compact events that technique 1 schedules.
Validation protocol: Drive a session into a 50k-token wrong path; branch A /rewind, branch B /compact, then issue the identical next task. Compare next-turn cache_read vs cache_creation, total cost, and whether branch B's answer references the dead end. Expect A: near-pure read; B: write spike + summary cost.

6. Deferred MCP tools make tool churn cache-free (and shrink the prefix)

With tool search (default on supported models), MCP server flap "only appends new content and doesn't disturb anything already cached."

Layer: input / tool configuration
Mechanism: Tool definitions render at position 0 (tools → system → messages), so any in-prefix change invalidates everything. Deferral loads schemas on demand, appended. Deferral is unavailable/disabled on Haiku models, Vertex AI, custom ANTHROPIC_BASE_URL gateways, and bypassed by alwaysLoad or threshold-based loading — those paths still invalidate on any server flap, including silent stdio crashes and HTTP session expiry (code.claude.com).
Expected savings: Two parts: (a) avoided full-history rewrite per server flap — same per-event economics as technique 1 ($2.30–3.80 at 200k); (b) permanent prefix cut — phase-0 local measurement: 11 local MCP tool schemas = 1,420 tok loaded vs ~60 tok as a deferred name list (~96% of schema footprint), which is also 1,360 fewer tokens re-read at 0.1x every turn.
Evidence tier: T1 — code.claude.com deferral rules + phase-0 local measurement .
Quality risk: NEGATIVE-COST in cost terms; small retrieval risk that the model fails to search for a deferred tool when needed — manifests as "tool not found/used" misses on niche tools.
Availability: CLAUDE-CODE-TODAY (default on supported models; audit for alwaysLoad and gateway setups that silently disable tool search).
Effort to adopt: Minutes (audit) — zero if defaults untouched.
Composability: The advisor tool sits after the cache breakpoint (toggling /advisor is cache-safe). Plugins shipping MCP servers inherit these rules. Compare the deferral-vs-loaded context numbers in 02-baseline-audit.md.
Validation protocol: With one flaky stdio server configured alwaysLoad, kill its process mid-session and observe the next turn's cache_creation spike; repeat with deferral on and observe no spike. Then verify the deferred tool is still discovered and invoked on a task that needs it.

Claude Code has a local file cache scoped to machine/directory/git state; the hosted server prompt cache is workspace/org/model/prefix-scoped. One SDK flag makes a hosted fleet share ONE cached system prompt.

Layer: infra / SDK fleet architecture
Mechanism: The claude_code preset embeds working directory, git-repo flag, platform, shell, OS version, and auto-memory paths in the system prompt; any difference is a server prompt-cache miss. excludeDynamicSections: true (TS SDK ≥ v0.2.98) / exclude_dynamic_sections: True (Python ≥ v0.1.58) / CLI --exclude-dynamic-system-prompt-sections moves that context into the first user message "so identical configurations share a cache entry across users and machines" (code.claude.com agent-sdk/modifying-system-prompts). Scope correction : the hosted server prompt cache is org/workspace/model/exact-prefix scoped; the machine+directory, startup-git-snapshot, and worktree rules describe Claude Code's local file cache layer, not the billed server cache. CLAUDE.md content is exempt — the SDK injects it into the conversation, not the system prompt.
Expected savings: ESTIMATE: each cold worker otherwise pays its own first-turn write of the fixed prefix — locally measured 28,669 tok for this repo → 28.7k × 1.25 × $10/M = $0.36 per cold worker at 5m TTL, plus re-writes at every TTL expiry across a bursty fleet. For N short-lived CI jobs, the shared entry converts N writes into 1 write + (N−1) reads, saving 1.15 × prefix × (N−1) base-units.
Evidence tier: T1 for mechanism (official docs); ESTIMATE for fleet dollars (no published figures; dynamic-section size itself unmeasured — see Gaps).
Quality risk: QUALITY-TRADE (mild, documented): environment context "carr[ies] marginally less weight" in the user message; enable when cross-session reuse outweighs maximally authoritative env context. Degradation would show as the agent mis-trusting cwd or auto-memory paths.
Availability: SDK
Effort to adopt: Hours — one flag plus byte-aligning model, effort, tool set, and append text across the fleet.
Composability: Pairs with 1h TTL for bursty CI and with technique 8 for wave starts. Anti-synergy: any per-worker prompt customization forks the cache.
Validation protocol: Launch two SDK workers in different directories with the flag on; assert worker 2's first call shows cache_read>0 on the system-prompt segment. Then run an env-sensitive task ("which directory are you in?") and assert correctness despite relocation of the context.

8. Stagger concurrent fan-out: await the first streamed token before firing the rest

N parallel identical-prefix requests all pay write price; staggering converts N−1 writes into reads.

Layer: cache / API call pattern
Mechanism: Documented (platform docs): "a cache entry only becomes available after the first response begins. If you need cache hits for parallel requests, wait for the first response before sending subsequent requests." Local corroboration both ways (21 subagent transcripts): the concurrently-spawned first wave showed first-call cache_read=0 across the board, while 9 subagents spawned later — after siblings' responses existed — read 4,802–5,055 tokens of shared prefix on their first call. Staggering happens naturally for sequential spawns; Claude Code does not appear to stagger parallel Task waves.
Expected savings: For N workers on shared prefix P: naive N × 1.25P; staggered 1.25P + (N−1) × 0.1P. At N=8, P=10k (the measured CC subagent prefix), Fable 5: $1.00 naive vs $0.20 staggered (~80% off the wave's input). Scales with P: a wave of 8 over a 100k shared document: $12.50 vs $1.95. (Corrects the sweep's $0.21/$2.13 figures — arithmetic: (1.25+0.7)×P×$10/M.)
Evidence tier: T1 — official concurrency rule + local partial-hit evidence.
Quality risk: NEUTRAL — adds one TTFT of latency to the wave start; outputs unchanged.
Availability: SDK (one await in your orchestrator); not user-controllable inside Claude Code's Task scheduling today.
Effort to adopt: Hours — await first streamed token (not completion) of request 1, then fire the rest.
Composability: Essential for technique 7 fleets, eval harnesses, batch-adjacent fan-out; pair with technique 3 (pre-warm, then fire the whole wave).
Validation protocol: Spawn two identical-prompt workers 10 s apart and two simultaneously; diff first-call cache_read (expect ~prefix vs 0). This also settles whether CC's residual 0-read spawns were the race or prompt diffs (see Gaps).

9. Batch API + prompt caching stack for offline/overnight agent work

Batch halves everything including cache prices; official guidance is one shared cache_control prefix per batch and the 1-hour TTL.

Layer: infra / workload scheduling
Mechanism: Batch docs (platform.claude.com): "The pricing discounts from prompt caching and Message Batches can stack… cache hits are provided on a best-effort basis. Users typically experience cache hit rates ranging from 30% to 98%, depending on their traffic patterns." Official Tip: "Since batches can take longer than 5 minutes to process, consider using the 1-hour cache duration." Constraints: each batched request needs max_tokens ≥ 1 — max_tokens: 0 is rejected "since an ephemeral cache entry written during batch processing would likely expire before the follow-up request runs"; 100,000 requests or 256 MB per batch; most complete <1 h, 24 h expiry; 50% off all token usage. cache_hint/context_hint are listed as sync-only "routing hints" (see Gaps).
Expected savings: ESTIMATE (arithmetic on documented multipliers): overnight 1,000-task eval sharing a 50k prefix on Sonnet 4.6 ($3/MTok): at 90% hit rate, prefix cost = 45M × $0.15/M + 5M × $1.875/M = $6.75 + $9.38 = $16.13 vs $150 sync-uncached (~89% off). At the documented 30% floor: 15M × $0.15/M + 35M × $1.875/M = $67.88 (~55% off). NOTE: this corrects the sweep's "$33.0 at 30%" figure, which doesn't survive the arithmetic. The 30–98% spread makes batch-cache economics ~4x uncertain — measure your own hit rate from result usage fields.
Evidence tier: T1 for multipliers/hit-rate range (official docs); ESTIMATE for workload dollars.
Quality risk: NEUTRAL — same models and prompts, async (up to 24 h); expired requests are unbilled.
Availability: SDK
Effort to adopt: Days — restructure offline work (evals, repo sweeps, nightly review queues) around an identical leading prefix.
Composability: Stacks with cheaper models; cannot combine with max_tokens:0 pre-warming inside the batch (warm synchronously beforehand instead) or fast mode.
Validation protocol: Submit a 100-request pilot batch with one cache_control'd 50k prefix at 1h TTL; compute the realized hit rate from per-result cache_read_input_tokens; spot-check 10 outputs against sync runs of identical requests for equivalence.

10. Mid-conversation system messages (beta): change operator instructions without burning the prefix

Append {role: "system"} to messages[] instead of editing the top-level system field.

Layer: input / API call pattern
Mechanism: Platform docs : "On Claude Opus 4.8, you can add a new system instruction partway through a conversation without invalidating the system or message caches. Append a {"role": "system"} message to messages instead of editing the top-level system field, so the cached prefix stays unchanged." Claude Code uses the same append-only shape internally: plan mode and skill loading "append their instructions as conversation messages, so the cached prefix stays intact" (code.claude.com). Fallback on other models: <system-reminder> text in a user turn (same cache profile, weaker authority).
Expected savings: One avoided full-prefix invalidation per mid-session instruction change — per-event economics of technique 1 ($2.30–3.80 at a 200k history on Fable 5).
Evidence tier: T1 for the documented behavior; model support beyond Opus 4.8 unverified (docs name only Opus 4.8 — see Gaps).
Quality risk: NEUTRAL — docs advise phrasing as context, not override commands; instruction adherence from message position is marginally weaker than top-level system.
Availability: SDK (beta header mid-conversation-system-2026-04-07); already CC's internal pattern, not user-invocable.
Effort to adopt: Hours for SDK harness authors.
Composability: The designated escape hatch for frozen-system-prompt discipline (no timestamps/UUIDs/flags interpolated into system); pairs with techniques 1 and 7.
Validation protocol: 50-turn synthetic session; at turn 25 inject a policy change via appended system message vs edited top-level system. Assert (a) cache_read continuity in variant A, full re-write in variant B; (b) equal instruction-following rates on the new policy across the remaining 25 turns.

11. Know the minimum cacheable prefix — model-specific, 512 to 4,096 tokens

Below the per-model minimum, cache_control silently does nothing; no error is returned.

Layer: cache / API mechanics
Mechanism: Live table (platform docs): Fable 5 / Mythos 5 = 512 (but 1,024 on Bedrock); Mythos Preview = 2,048; Opus 4.8 = 1,024; Opus 4.7 = 2,048; Opus 4.6 / 4.5 = 4,096; Opus 4.1 / 4 = 1,024; Sonnet 4.6 / 4.5 / 4 = 1,024; Haiku 4.5 = 4,096; Haiku 3.5 = 2,048. "Any requests to cache fewer than this number of tokens will be processed without caching, and no error is returned." Detection: both cache_creation_input_tokens and cache_read_input_tokens stay 0. The counterintuitive ordering: Haiku 4.5 — the model picked for cheap high-frequency calls — has the HIGHEST floor; a 3.5k classifier prompt is uncacheable on Haiku 4.5 but caches fine on Sonnet 4.6.
Expected savings: Avoids silent 100% cache-miss on small prompts. ESTIMATE: a 3.5k-token prompt called 1,000x/day pays 3.5M × $1/M = $3.50/day uncached on Haiku 4.5; padded ≥4,096 (or moved to Sonnet 4.6) it pays ~0.1x on hits — order of $0.35–1.05/day equivalent. Workload-specific; verify against your shapes. Local relevance: this repo's always-on AGENTS.md alone = 2,744 tok on Fable 5 (measured via /tmp/ct.py; 1,919 on Sonnet 4.6 — the +43% tokenizer gap compounds cache-write premiums) — above Fable/Sonnet floors, below Haiku 4.5's; CC's full 28.7k fixed prefix clears every minimum.
Evidence tier: T1 — live table, two consistent fetches; local token counts via /tmp/ct.py.
Quality risk: NEGATIVE-COST — pure accounting awareness; padding a prompt to cross a floor is the only act with any content effect (keep padding semantically inert).
Availability: SDK (Claude Code prompts always clear the minimums)
Effort to adopt: Minutes — a CI assertion that cache_creation>0 on prompts you believe cache.
Composability: Gate check for techniques 3, 7, 8, 9 on small prompts. These minimums have already moved between doc revisions — never quote from memory.
Validation protocol: Send a ~600-token cache_control'd request to Fable 5 (above its 512 floor) and the same to Haiku 4.5 (below its 4,096): assert cache fields populate for the former and stay 0 for the latter; also settles whether the minimum counts tools+system+messages cumulatively.

12. Measure before optimizing: statusline, transcript JSONL, OTel

Cache problems are diagnosable from fields you already have, free.

Layer: infra / observability
Mechanism: Statusline scripts read current_usage with cache_creation_input_tokens / cache_read_input_tokens; official heuristic: "A high read-to-creation ratio means caching is working well. If creation stays high turn after turn, something is changing in your prefix" (code.claude.com). Org-wide: OTel exporter reports cache read/creation per user and session. Locally, ~/.claude/projects/**/*.jsonl carries per-call message.usage including the ephemeral_5m/1h split — that is how every local number in this file was produced, with a 30-line parser. API-side: if cache_read is 0 across identical-prefix requests, hunt the silent invalidator (timestamps, unsorted JSON, varying tool sets).
Expected savings: Enabler, not a saver. Healthy reference points (local): this session 98.05% reads / 1.90% writes / 0.05% uncached at 300 calls; phase-0 heavy session 92.83 / 6.73 / 0.44. Key diagnostic insight from the two snapshots: write-vs-read dollar dominance flips with session age ($14.47 writes > $7.36 reads at 95 calls; $19.93 writes < $51.40 reads at 300) — target write hygiene early, read volume (context size, see 12-context-architecture.md) late. Quota angle: issue #24147 measured 1,310 cache-read tokens per productive I/O token over 30 days — in quota terms reads are the whole game even when dollars say writes.
Evidence tier: T1 (official fields + local reproduction); the #24147 quota-weighting claim itself is unconfirmed (no Anthropic response on the issue) — the ratio is real, the weighting is not published.
Quality risk: NEGATIVE-COST.
Availability: CLAUDE-CODE-TODAY
Effort to adopt: Minutes — statusline script or JSONL parser (/tmp/cache_analysis.py pattern); validation harness integration in 31-validation-harness.md.
Composability: Feeds every technique above; before quoting community CLI percentages (ccusage etc.), check their hit-rate formula (reads/(reads+writes+uncached) vs per-turn ratios differ).
Validation protocol: Self-validating — recompute one session's cost from JSONL and reconcile against console billing / usage page to ±2%.

Claims killed (each verified against primary sources)

"Editing CLAUDE.md mid-session busts your cache." Opposite is documented: root/user CLAUDE.md "are read once at session start… Editing them mid-session does not invalidate the cache, but the edit also doesn't apply" until /clear, /compact, or restart (code.claude.com). Community posts blaming CLAUDE.md edits for cache spikes are wrong for current CC root-level files; only nested CLAUDE.md / paths:-frontmatter rules load lazily.
"The 1-hour cache needs ≥3 reads to pay off." Vs no caching: 2x write + n×0.1 reads < (1+n)×1 at n=2 (2.2 < 3.0); n=1 narrowly loses (2.1 > 2.0). Vs 5m TTL in gap-prone traffic: a single >5-min-gap reuse wins (2.1 < 2.5). The "3" is a garbled "at least three requests" (= 1 write + 2 reads).
"Ping count_tokens to keep the cache warm — it's free." Free, yes; cache-inert, explicitly: "No, token counting provides an estimate without using caching logic… prompt caching only occurs during actual message creation" (token-counting docs FAQ). Separate RPM pool: 100 / 2,000 / 4,000 / 8,000 by tier. The real warmer is max_tokens:0 (technique 3).
"max_tokens must be ≥1, so warmers waste an output token." Outdated — max_tokens:0 is GA, bills zero output, and docs say it replaces the max_tokens:1 workaround. Still true inside batches only (max_tokens ≥ 1 there, with a documented reason).
"Switching permission modes busts the cache." "Mode changes are cache-safe" — sole exception is plan mode under opusplan, where the cost is the MODEL switch (code.claude.com). Also safe: output-style changes (don't apply mid-session), scoped deny rules, skills, /recap.
"Adding/removing an MCP server mid-session always invalidates." Only when tool definitions load into the prefix; with deferred tools (the default on supported models) a server flap "only appends new content." Survives only for Haiku models, Vertex AI, custom gateways, alwaysLoad, threshold-loaded definitions.
"Cache reads are basically free, ignore them." Dollars: $51.40 (largest input-side line, 72% of this session's input bill) at 300 calls (local). Quota: issue #24147 — 5,092,500,074 reads vs 3,887,759 I/O tokens in 30 days (1,310:1, 99.93%), open, no official response on weighting.
"Minimum cacheable prompt is 1,024 tokens." Stale — live table is per-model, 512–4,096 (technique 11), differs by platform (Bedrock), and has changed between doc revisions. Even the bundled claude-api skill's copy disagrees with the live page.
"Set up a keepalive pinger for Claude Code sessions." Subscriptions already write the main thread at 1h TTL (locally: 320/320 calls); sub-hour breaks cost a 0.1x re-read, and pure-API pinging isn't reachable from inside CC anyway. Keepalive is rational only for raw-SDK 5m-TTL callers with planned gaps under ~40 min (technique 3 arithmetic).

Surprising findings

The split-TTL strategy (1h main / 5m subagents) is real, automatic, and measured here at 100%/100% — and at 2x write price it made writes the dominant input cost early in the session before reads took over.
The cache is keyed by effort level, not just model — /effort mid-session re-reads the whole history uncached, which is why CC now interposes a confirmation dialog.
/rewind hits cache entries older than the TTL because every intervening turn refreshed the shared prefix for free.
Your git state affects Claude Code's local file-cache reuse: sequential sessions share that local layer only if the startup branch+commits snapshot matches; worktrees of one repo never share. The billed server prompt cache is separately scoped by workspace/org/model/exact prefix (see technique 7).
MCP servers can bust the cache with no user action (stdio crash, HTTP session expiry, auto-reconnect) when their tools are in-prefix.
Sequentially-spawned subagents got 4,802–5,055-token first-call cache hits where the concurrent wave got zero — the documented concurrency race, visible in the wild (local).

Gaps / unresolved

Quota weighting of cache reads on subscriptions — #24147 demonstrates the 1,310:1 ratio; Anthropic publishes no formula. "Reads are cheap" is true in dollars, unprovable in quota.
cache_hint / context_hint — surfaced in the batch docs' not-supported table as sync-only "routing hints"; no dedicated docs page found. Potentially a whole technique (cache-aware request routing); watch for documentation.
max_tokens:0 warm-ping billing — write-charge-if-cold and zero output are documented; the 0.1x read line item on a warm ping is inferred, one live request away from closed (protocol in technique 3).
20-position lookback in CC practice — API rule verified; whether CC's breakpoint placement can miss it on >20-block tool-heavy turns is unobserved harness internals.
Mixed-TTL strategies (5m normally, one 1h write before a planned break) — docs have a mixing section not fully verified here; technique 3's table assumes one TTL per prefix.
Dynamic-section token size for technique 7 — measurable by diffing count_tokens over two cwd variants of the preset; not yet done.
Model support list for mid-conversation system messages beyond Opus 4.8 (Fable 5 implied by family, not quoted).
The community "99.5% sustained hit rate" post (vsits.co) 403'd on direct fetch — search-snippet only, excluded from evidence (unverified T3).

Verification ledger

#	Number / claim	Source or method (all accessed/)
1	0.1x read / 1.25x 5m write / 2x 1h write; free refresh on use	https://platform.claude.com/docs/en/build-with-claude/prompt-caching.md (live fetch)
2	max_tokens:0 semantics, zero output billed, rejection list, replaces max_tokens:1	same page
3	Min cacheable prefix table: Fable 5=512 (Bedrock 1,024), Opus 4.8=1,024, Opus 4.7=2,048, Opus 4.6/4.5=4,096, Sonnet 4.6/4.5=1,024, Haiku 4.5=4,096, Haiku 3.5=2,048; silent no-error failure	same page
4	Concurrency rule (entry available after first response begins)	same page
5	4 breakpoints; 20-position lookback per breakpoint	same page
6	Usage sum rule (input + cache_creation + cache_read)	same page
7	Opus 4.8 mid-conversation `{role:"system"}` exception	same page
8	8 invalidators incl. effort-level keying, opusplan, fast-mode header (v2.1.86+), /reload-plugins guard (v2.1.163+), upgrade-resume warning	https://code.claude.com/docs/en/prompt-caching (live fetch)
9	Cache-safe list; CLAUDE.md "read once… edit doesn't apply" quote	same page
10	TTL policy: subscription 1h auto, subagents 5m, ENABLE_PROMPT_CACHING_1H, FORCE_PROMPT_CACHING_5M, credits→5m; DISABLE_PROMPT_CACHING vars	same page
11	/rewind past-TTL warmth quote; /recap appends; compaction cost anatomy	same page
12	Subagent own-cache/5m; fork inherits parent cache	same page
13	Cache scope correction: hosted server cache = workspace/org/model/exact prefix; Claude Code local file cache = machine+directory/git snapshot/worktree rules; workspace isolation	code.claude.com prompt-caching + platform prompt-caching; corrected
14	excludeDynamicSections: TS ≥v0.2.98 / Py ≥v0.1.58, CLI flag, "share a cache entry across users and machines", "marginally less weight" tradeoff; CLAUDE.md injected into conversation	https://code.claude.com/docs/en/agent-sdk/modifying-system-prompts (live fetch)
15	Batch+cache stacking; 30–98% best-effort hit rates; 1h-TTL tip; max_tokens≥1 in batches with quoted reason; cache_hint/context_hint sync-only; 100k req / 256 MB / <1h typical / 24h expiry; 50% discount	https://platform.claude.com/docs/en/build-with-claude/batch-processing.md (live fetch, persisted to tool-results)
16	count_tokens cache-inert FAQ; free; RPM 100/2,000/4,000/8,000	https://platform.claude.com/docs/en/build-with-claude/token-counting.md (live fetch)
17	Fable 5/Mythos 5 tokenizer "roughly 30% more tokens" than pre-Opus-4.7 models	same page (consistent with phase-0 local +15% code / +38% prose)
18	#24147: 5,092,500,074 cache reads vs 3,887,759 I/O over 30 days (Jan 9–Feb 8 2026); 1,310:1; 99.93%; open; no official response	https://github.com/anthropics/claude-code/issues/24147 (live fetch)
19	Main thread 320/320 calls 1h TTL (1,012,031 tok, zero 5m); subagents 1,128/1,128 calls 5m (10,090,230 tok, zero 1h)	Local: TTL summarizer over `~/.claude/projects/.../2e1d1f1f*.jsonl` + subagents/
20	300-call snapshot: 25,406 uncached / 996,636 write / 51,401,035 read / 377,159 out; mix 0.05/1.90/98.05%; $0.25+$19.93+$51.40=$71.59 vs $524.23 uncached-equivalent = 86.3% saved	Local: /tmp/cache_analysis.py over same transcripts
21	Earlier same-session snapshot: 95 calls, mix 0.15/8.93/90.92%, writes $14.47 > reads $7.36, 723,341 1h tokens	Sweep-agent local measurement earlier (same session, superseded totals; retained for the write/read flip)
22	First-turn fixed-prefix write 28,669 tok (1h)	Local: first assistant message usage in main transcript
23	Subagent first calls: 11/21 cache_read=0; 9/21 read 4,802–5,055; first-call writes 4,859–30,495 + ~3,910 uncached → spawn $0.10–0.42	Local: per-transcript first-call usage
24	`AGENTS.md` = 2,744 tok (claude-fable-5) vs 1,919 (claude-sonnet-4-6), +43%	Local: count_tokens run on `AGENTS.md`
25	11 MCP tool schemas = 1,420 tok loaded vs ~60 deferred; modeled day $22 (reads 32%/writes 29%/thinking 20%/visible 17%/uncached 2%)	Phase-0 local measurements (given as ground truth)
26	Per-bust $2.30/$3.80 at 200k; ping table (1.25+0.1·⌈G/5⌉ vs 2.1 vs 2.5, ~40/60-min crossovers); break-even table (9.4k/7.3k/5.0k/2.6k @ T=5/10/20/50); stagger $1.00→$0.20 (N=8, P=10k); batch eval $16.13 @90% / $67.88 @30% vs $150; 1h premium $7.59 this session; $0.36/cold worker; n=2 break-even for 1h vs uncached	ESTIMATE — arithmetic shown inline from rows 1, 15, 19–23
27	vsits.co "99.5% hit rate"	UNVERIFIED (T3, fetch 403'd) — excluded from evidence-bearing claims

13 — Caching exploitation

On this page