18 — Provider-level features, current and beta

TL;DR

Prompt caching is the lever already running: 0.1x reads / 1.25x (5m) or 2x (1h) writes, free TTL refresh on every hit, and Fable 5 now has the LOWEST minimum cacheable prefix ever shipped (512 tokens vs 4,096 on Opus 4.6/Haiku 4.5). Without the 0.1x read rate, the modeled heavy day (~$22) would cost ~$85 (ESTIMATE, arithmetic below). All T1, verified live.
The two biggest published-number levers for agent workloads: context editing (vendor: 84% token cut AND +29% task performance; with memory tool +39%) and tool search/defer_loading (vendor: >85% tool-context cut, 77K→8.7K tokens, accuracy 49%→74% on Opus 4; locally reproduced at ~96% on 11 MCP schemas). Both are SDK-side; neither is a Claude Code knob.
The 1M-context premium is dead: Fable 5/Opus 4.8-4.6/Sonnet 4.6 bill flat per-token across the window ("A 900k-token request is billed at the same per-token rate as a 9k-token request"), and 1M is the DEFAULT on Fable 5 — every 2025 cost model with a 200k pricing cliff is wrong.
Thinking spend (54.8% of output tokens, ~20% of dollars in the local max-effort session) has exactly one API control on Fable 5: output_config.effort. Anthropic publishes NO per-level percentages — only qualitative guidance — so the magnitude is the dossier's biggest open measurement (T4).
Two of Anthropic's own live pages still say prior-turn thinking is "automatically stripped"; the adaptive-thinking and context-editing pages say Opus 4.5+/Sonnet 4.6+/Fable 5 keep ALL prior thinking turns in context and bill them as input. The contradiction is live and worth real money on long sessions.

Method: every doc claim re-fetched live from platform.claude.com (docs.claude.com redirects there) / claude.com / anthropic.com / github.com; cheap-to-check token claims reproduced locally via the free count_tokens harness (/tmp/ct.py, OAuth, exact-model counts). Modeled session profile for arithmetic (local Phase-0 measurement): heavy day ≈ 6 sessions × (19 calls, 5.5k uncached in, 85k cache-write, 1.17M cache-read, 27k out @ 55% thinking) ≈ $22/day at Fable 5 list prices — internally consistent with the live price table to within 1%.

Where the local dollars go, and which provider lever attacks each bucket

Bucket (local split of ~$22/day)	$/day	Provider levers
Cache reads 32%	~$7.0	compaction (caps re-read context), context editing, tool search (smaller prefix), 1M discipline
Cache writes 29%	~$6.4	compaction (shorter prefixes re-written), avoiding cache busts (defer_loading, stable thinking mode)
Thinking output 20%	~$4.4	effort (only lever on Fable 5), per-message steering
Visible output 17%	~$3.7	effort (affects ALL response tokens), see 15-output-discipline.md
Uncached input 2%	~$0.4	nothing needed — caching already absorbs it

Local measurements (count_tokens API, method: /tmp/ct.py)

Tokenizer comparison — identical text counted per model (samples: 2.0kB markdown prose from this repo, 2.0kB Rust, 2.9kB JSON tool schemas):

Sample	chars	Fable 5	Opus 4.8	Sonnet 4.6	Haiku 4.5	Fable vs Sonnet	chars/tok (Fable / Sonnet)
prose	1,985	706	706	512	512	+37.9%	2.81 / 3.88
code (Rust)	2,000	806	806	631	631	+27.7%	2.48 / 3.17
JSON schema	2,898	969	969	890	890	+8.9%	2.99 / 3.26

Fable 5 ≡ Opus 4.8 and Sonnet 4.6 ≡ Haiku 4.5 byte-for-byte on every sample — two tokenizer families confirmed. The +27.7% on this Rust sample is above the Phase-0 ~15% code figure (different sample) but inside the vendor's "up to 35%" band; structured JSON pays the smallest penalty.

Tool-use overhead — docs' reference request (get_weather tool, "What's the weather like in San Francisco?"):

Model	with tool	no tool	delta	documented system tax (auto/none)
Fable 5	403	19	384	absent from pricing table
Opus 4.8	403	19	384	290
Opus 4.7	793	24	769	675
Sonnet 4.6	587	16	571	497
Haiku 4.5	586	16	570	496

The 403 is an EXACT reproduction of the token-counting page's published example output, and Fable 5 matches Opus 4.8 identically — pinning Fable 5's undocumented tool-use system tax at the lean 290-token Opus 4.8 prompt, not 4.7's 675 (a 57% quiet drop between 4.7 and 4.8).

One negative result: count_tokens REJECTS fabricated thinking-block signatures (400: Invalid signature in thinking block, reproduced on Fable 5 and Sonnet 4.6) — so the prior-turn-thinking billing question cannot be settled with the free harness alone; it needs a real signed transcript (see record 5 and Gaps).

Technique records

1. Prompt caching (incl. automatic mode) — the lever already paying for itself

Layer: cache / input.
Mechanism: Opt-in cache_control. Automatic mode: one top-level field, system places/moves a single breakpoint on the last cacheable block (Claude API, Claude Platform on AWS, MS Foundry beta — NOT Bedrock/Vertex). Explicit mode: up to 4 breakpoints, 20-block lookback per breakpoint. Writes 1.25x (5m TTL) / 2x (1h); reads 0.1x; "The cache is refreshed for no additional cost each time the cached content is used." Min cacheable prefix (live table): Fable 5/Mythos 5 = 512 (1,024 on Bedrock), Opus 4.8 = 1,024, Opus 4.7 = 2,048, Opus 4.6/4.5 = 4,096, Sonnet 4.6/4.5 = 1,024, Haiku 4.5 = 4,096. Invalidation hierarchy tools→system→messages: tool-definition edits bust everything; tool_choice/images/thinking-parameter changes bust messages only.
Expected savings: already realized in the baseline: local heavy session prompt mix is 0.44% uncached / 6.73% cache-write / 92.83% cache-read; reads+writes = 61% of dollars. ESTIMATE counterfactual: 7.02M cache-read tok/day at 1.0x instead of 0.1x = $70.2 vs $7.0 → day total ~$85 vs ~$22 (~3.9x). Pays off after 1 read (5m) or 2 reads (1h) — pricing page's own arithmetic.
Evidence tier: T1 (live caching + pricing pages) + local baseline measurement.
Quality risk: NEUTRAL — identical model output; pure price arbitrage. Falsification: none needed; verify via usage.cache_read_input_tokens > 0.
Availability: CLAUDE-CODE-TODAY (applied automatically; the 92.83% read share is the proof) + SDK for breakpoint/TTL control.
Effort to adopt: zero in Claude Code; minutes in API code. 1h TTL only wins when request gaps exceed 5 min — free refresh makes 5m TTL sufficient for active sessions.
Composability: multipliers stack with Batch 50% and data-residency 1.1x ("These multipliers stack with other pricing modifiers"). Tension: context editing/compaction invalidate cache at the edit point; defer_loading and stable adaptive thinking preserve it. See 13-caching-exploitation.md.
Validation protocol: for any session, assert cache_read_input_tokens / (input + cache_creation + cache_read) > 0.85 across calls 2..N; alert on zero reads with identical prefixes (silent invalidator). Distinguish 5m vs 1h writes via the per-TTL cache_creation breakdown fields in usage.

2. Server-side compaction (`compact_20260112`) — API-native auto-summarization

Layer: turn-structure / context lifecycle.
Mechanism: Beta anthropic-beta: compact-2026-01-12, edit {"type": "compact_20260112"} under context_management. Default trigger 150,000 input tokens (minimum 50,000). API emits a compaction block; on later requests everything before it is dropped server-side. Optional pause_after_compaction and custom instructions (REPLACE the default prompt entirely). Billing: summarization is a separate iteration — doc example: compaction 180,000 in + 3,500 out, message 23,000 in + 1,000 out → 203k in + 4.5k out total, while top-level usage shows only 23k/1k (~88% of the turn invisible to naive dashboards). Re-applying an existing compaction block costs nothing. Models: Fable 5, Mythos 5/Preview, Opus 4.8/4.7/4.6, Sonnet 4.6. Summarization runs on the SAME model — "There is no option to use a different (for example, cheaper) model"; with tools defined the model may call a tool instead of summarizing (content: null) unless instructions forbid it.
Expected savings: no vendor savings/quality numbers published (T4 for magnitude). ESTIMATE: capping per-turn re-read context at 150k instead of growing toward 1M caps the cache-read line at ~15% of worst case ($0.15 vs $1.00 per Fable 5 turn at 0.1x); on the local profile the cache-read bucket (32% of dollars) scales with retained context.
Evidence tier: T1 mechanics/billing (live compaction page); T4 net savings.
Quality risk: QUALITY-TRADE — summary is lossy by construction; zero published evals. Degradation: forgotten constraints/decisions after the boundary. Falsification: continue 20 real coding tasks from compacted vs full history; diff task success and re-asked-question rate.
Availability: SDK (Messages API beta). NOT a Claude Code knob — Claude Code ships its own client-side auto-compact, a separate implementation with at least one live mis-trigger regression: auto-compact firing at 84k/1M (8% usage) every 1-2 min in v2.1.112 (github.com/anthropics/claude-code/issues/52981).
Effort to adopt: hours (one edits entry + append full response.content so compaction blocks persist; track cost via usage.iterations[], populated only on turns that trigger a new compaction).
Composability: pairs with memory tool (state survives the boundary); put a cache breakpoint at end of system prompt so compactions don't bust the system prefix; compaction blocks accept cache_control. Each compaction = one cache re-write of the new shorter prefix.
Validation protocol: A/B a 500k-token scripted agent run, compaction-at-150k vs raw growth: compare total $ (summed over iterations[]), task completion, and factual-recall probes referencing pre-boundary content.

3. Context editing: `clear_tool_uses` + `clear_thinking` — the rare negative-cost lever (vendor-claimed)

Layer: turn-structure / context lifecycle.
Mechanism: Beta context-management-2025-06-27. clear_tool_uses_20250919: clears oldest tool results past trigger (defaults: 100,000-token trigger, keep 3 most-recent tool uses; optional clear_at_least / exclude_tools / clear_tool_inputs). clear_thinking_20251015: keep last N thinking turns; must be listed FIRST when combined. Applied server-side pre-inference; client keeps full history. count_tokens previews the effect (context_management.original_input_tokens vs post-edit input_tokens).
Expected savings: vendor (claude.com/blog/context-management, verbatim): "Context editing alone delivered a 29% improvement"; "combining the memory tool with context editing improved performance by 39%"; 100-turn web-search eval "reducing token consumption by 84%". On the local profile an 84%-class cut applies to the 61%-of-dollars cache block → ESTIMATE up to ~$11/day ceiling if it transferred fully to coding (unproven).
Evidence tier: T1 shipped + vendor-published numbers; NO independent replication found — treat magnitude as vendor-internal (web-search eval, not coding).
Quality risk: NEGATIVE-COST per vendor (tokens down AND +29% performance — stale tool results are anti-signal); treat as RISKY on coding until replicated, since cleared results are unrecoverable without memory. Falsification: replicate on Read/Grep-heavy coding tasks; check for re-reading of cleared files.
Availability: SDK (also Bedrock/Vertex per launch post). Not a Claude Code knob.
Effort to adopt: hours — one context_management block; tune clear_at_least so each cache bust buys a big clear.
Composability: DIRECT CACHE TENSION (live page): clearing "invalidates cached prompt prefixes"; kept thinking preserves cache. Integrates with memory tool. Anti-synergy with compaction if both fire on the same window — pick one primary strategy (docs position compaction as primary).
Validation protocol: fixed 30-task coding suite, edits on/off: compare success rate, $/task (cache-write churn included), and count_tokens previews vs realized billing.

4. Effort parameter + adaptive thinking — the only thinking-spend control on Fable 5

Layer: output.
Mechanism: output_config: {effort: low|medium|high|xhigh|max}, GA, no beta header; high default ("exactly the same behavior as omitting the parameter"). Affects ALL response tokens: thinking, tool-call count, preamble ("lower effort would mean Claude makes fewer tool calls"). max on Fable 5/Mythos/Opus 4.8/4.7/4.6/Sonnet 4.6; xhigh only Fable 5/Mythos 5/Opus 4.8/4.7. On Fable 5 thinking is always-on ({type:"disabled"} rejected); budget_tokens 400s on Opus 4.8/4.7, deprecated on 4.6/Sonnet 4.6. max_tokens is the hard cap (docs: start 64k at xhigh/max). Claude Code: default high (effort page: Opus 4.8 "default is high on all surfaces, including the Claude API and Claude Code"); "ultracode" is NOT an API level — it is xhigh + standing multi-agent permission injected via mid-conversation system messages. Observability: usage.output_tokens_details.thinking_tokens splits billed reasoning from response (strictly better than the Phase-0 subtraction method). display: omitted/summarized never changes billing.
Expected savings: NO published per-level percentages anywhere (confirmed by full-page read) — only qualitative ("low: Significant token savings with some capability reduction"; Sonnet 4.6 recommended default medium). Local anchor: thinking = 54.8% of output tokens at max effort (n=1) and output = 37% of session dollars → ESTIMATE a 50% thinking cut ≈ 10% of session cost (0.5 × 20%), plus second-order input savings from fewer tool calls.
Evidence tier: T1 mechanics; T4 magnitude per level (vendor publishes none — measure locally).
Quality risk: QUALITY-TRADE below high (docs: "accepting some reduction in capability"; at low/medium Opus 4.7+ "scopes its work to what was asked"). Conversely dropping max→high can be NEGATIVE-COST: docs warn max "can lead to overthinking" on structured-output / less intelligence-sensitive tasks.
Availability: CLAUDE-CODE-TODAY (effort menu) + SDK.
Effort to adopt: minutes (menu item / one field); the real work is the eval.
Composability: consecutive adaptive-mode requests preserve cache; SWITCHING thinking mode busts message cache (system/tools survive). Per-message steering ("Answer directly without deliberating.") works with no parameter change at all → zero cache impact.
Validation protocol: fixed Claude Code task set at low/medium/high/xhigh; read output_tokens_details.thinking_tokens per run; report tokens + pass-rate per level. This closes the dossier's biggest missing number and re-bases the 54.8% n=1. See 31-validation-harness.md.

5. Prior-turn thinking retention — the hidden input-side cost of reasoning models

Layer: input / context lifecycle.
Mechanism: Adaptive-thinking page (verbatim): "whether the API keeps or strips them depends on the model: Opus 4.5+ and Sonnet 4.6+ keep them in context by default; earlier Opus/Sonnet models and all Haiku models strip them"; pricing section bills "Thinking blocks from prior assistant turns kept in context: ... all turns by default on Opus 4.5+ and Sonnet 4.6+ (input tokens)". Reclaim via clear_thinking_20251015 (keep: N). LIVE CONTRADICTION (both pages): context-windows page still says "previous thinking blocks are automatically stripped from the context window calculation by the Claude API"; token-counting page says prior-turn thinking "do not count toward your input tokens" (plausibly true for the counting endpoint specifically, while create keeps and bills — but the context-windows text is flatly stale).
Expected savings: ESTIMATE: at the local 54.8% thinking share, kept thinking re-enters every later prompt; at 0.1x cache-read it is cheap per token but compounds the 61%-of-dollars cache block over 50-turn sessions. clear_thinking keep:1-2 reclaims most of it at the cost of one cache bust.
Evidence tier: T1 for retention behavior (two live pages agree, two disagree); T4 dollar magnitude. Local probe blocked: count_tokens validates signatures (400 on fabricated blocks, reproduced) — needs a real signed transcript.
Quality risk: retention presumably aids reasoning continuity; aggressive clearing is RISKY for long multi-step chains, NEUTRAL for fresh-task-per-turn usage. Exception that can never be cleared: the thinking block of an in-flight tool_use cycle must be returned.
Availability: behavior affects CLAUDE-CODE-TODAY sessions implicitly; the clear_thinking control is SDK-only.
Effort to adopt: hours (one edits entry) once on the SDK.
Composability: kept thinking = stable cache-friendly prefix; clearing busts cache at the clear point. Interacts with effort (less thinking generated → less retained).
Validation protocol: capture a real multi-turn Fable 5 transcript (signed blocks); count_tokens it with and without prior thinking passed back AND diff usage.input_tokens on live create calls — resolves the docs contradiction empirically and prices the retained share.

6. Tool search tool + defer_loading — stop paying for unused tool schemas

Layer: input (prefix).
Mechanism: Two server-side variants, no beta header: tool_search_tool_regex_20251119 (Python re.search, ≤200 chars) and tool_search_tool_bm25_20251119 (natural language). defer_loading: true tools are excluded from the system-prompt prefix; on discovery the API appends a tool_reference block and expands it — verbatim: "The prefix is untouched, so prompt caching is preserved." 3-5 references per search; ≤10,000 tools; expanded references persist across turns. MCP defers via mcp_toolset. Incompatible with tool-use examples; strict-mode grammar builds from the full toolset (no recompilation).
Expected savings: docs: typical 5-server MCP setup "~55k tokens in definitions"; tool search "typically reduces this by over 85%"; accuracy "degrades significantly once you exceed 30-50 available tools". Engineering post: 77K→8.7K upfront ("85% reduction... preserving 95% of context window"); MCP-eval accuracy Opus 4 49%→74%, Opus 4.5 79.5%→88.1%; internal worst case 134K tokens of definitions. Local ground truth: 11 MCP schemas = 1,420 tok loaded vs ~60 tok deferred ≈ 96% cut (Phase-0) — direction and magnitude consistent with vendor.
Evidence tier: T1 (vendor-published + locally reproduced).
Quality risk: NEGATIVE-COST for big catalogs (tokens down, accuracy up); NEUTRAL-to-negative under 10 tools / <100 schema tokens (docs' own threshold) where search round-trips just add latency.
Availability: CLAUDE-CODE-TODAY (this session's harness runs 37 deferred tools behind a ToolSearch tool) + SDK.
Effort to adopt: minutes-hours: flag per tool + include the search tool; keep the top 3-5 tools non-deferred.
Composability: the standout property is cache preservation — most context surgery busts cache, this does not. Works in Batch at normal pricing. Composes with strict mode without grammar recompilation.
Validation protocol: measure per-search overhead (server_tool_use block + expanded definitions) vs schemas saved on your catalog: count_tokens the request with all tools loaded vs deferred + N searches; assert net saving and unchanged task pass-rate.

7. Programmatic tool calling (PTC) — tool results that never enter context

Layer: input (tool-result volume).
Mechanism: Requires code_execution_20260120; tools opt in via allowed_callers: ["code_execution_20260120"] (guidance, "not a security boundary"). Claude writes a script in the sandbox; each in-script tool call pauses execution, client supplies tool_result, script resumes. Verbatim: "Tool results from programmatic invocations do not count toward your input/output token usage. Only the final code execution result and Claude's response count." Platforms: Claude API, Claude Platform on AWS, MS Foundry — NOT Bedrock/Vertex. Exclusions: MCP-connector tools, strict: true tools.
Expected savings: vendor (all verbatim): BrowseComp+DeepSearchQA "improved performance by an average of 11% while using 24% fewer input tokens"; 75-tool benchmark "reduced billed input tokens by roughly 38% with no change in task accuracy"; production traffic 10-49 tools "typical token savings of 20% to 40%". NEGATIVE RESULT, same page: τ²-bench "left scores unchanged and cost roughly 8% more. Sequential single-call workflows do not benefit." Engineering post: 43,588→27,297 tok (37%) on complex research.
Evidence tier: T1 (multiple vendor benchmarks incl. an honest negative).
Quality risk: NEGATIVE-COST on fan-out/filter shapes; RISKY (+8% cost, no gain) on sequential reason-per-call shapes — which is exactly the Read/Edit/Bash profile of most coding-agent turns. Do not assume the 20-40% band transfers to Claude Code-like traffic.
Availability: SDK (server-side containers). Not how Claude Code runs tools; Claude Code subagents achieve a similar result-isolation effect client-side.
Effort to adopt: days — enable code execution, tag tools, implement the pause/resume protocol.
Composability: batch-compatible at batch pricing. Container costs: 1,550 free hours/org/month, then $0.05/hr (5-min minimum), FREE when web_search_20260209/web_fetch_20260209 is in the request. Conflicts with strict tools and MCP connector.
Validation protocol: replay a week of real tool traffic shapes through PTC vs direct; segment by fan-out width; adopt only for shapes where billed input drops with flat accuracy (expect the sequential segment to reproduce the τ²-bench negative).

8. Batch API — 50% off everything, stacks with caching

Layer: infra / pricing modifier.
Mechanism: Async Messages Batches; live table is exactly 50% on input AND output for every active model (Fable 5 $5/$25, Opus 4.8 $2.50/$12.50, Sonnet 4.6 $1.50/$7.50, Haiku 4.5 $0.50/$2.50). FAQ verbatim: "Batch API and prompt caching discounts can be combined"; multipliers stack → batched cache read = 0.5 × 0.1 = 0.05x base input (ESTIMATE arithmetic from stated stacking). 1M requests batch at the same rates; tool search batches at normal pricing. Excluded: fast mode, Managed Agents sessions.
Expected savings: exactly 50% of whatever moves to batch — the only lever touching all five local dollar buckets at once, including the 37%-of-dollars output side.
Evidence tier: T1 (pricing page). SDK docs: most batches < 1h, max 24h (cached skill reference) — re-verify turnaround on your own workload.
Quality risk: NEUTRAL on quality; cost is latency (results within 24h).
Availability: SDK only — useless for interactive Claude Code; ideal for nightly review sweeps, eval runs, offline subagent fan-outs.
Effort to adopt: days (restructure offline workloads into batch submissions).
Composability: stacks with caching and 1M context; mutually exclusive with fast mode and interactivity.
Validation protocol: move one nightly workload (e.g. repo-wide code-review sweep) to batch; confirm 0.5x on the invoice line and measure realized turnaround distribution.

9. 1M context at standard pricing — a cost-model correction, not a saving

Layer: infra / pricing model.
Mechanism: Pricing page verbatim: Fable 5, Mythos 5/Preview, Opus 4.8/4.7/4.6, Sonnet 4.6 "include the full 1M token context window at standard pricing. (A 900k-token request is billed at the same per-token rate as a 9k-token request.)" Caching and batch apply at standard rates across the window. Context-windows page: 1M is the DEFAULT on Fable 5; 128k max output; Opus 4.8 is 200k on MS Foundry only; Sonnet 4.5 and older stay 200k. Overflow on 4.5+: request accepted, generation stops with stop_reason: "model_context_window_exceeded".
Expected savings: none directly — kills the 2025-era >200k premium (2x in / 1.5x out) from every old cost model. Flat pricing means bloat is LINEARLY expensive with no cliff: a full 1M Fable 5 prompt = $10 uncached / $1 cached per request (ESTIMATE from list prices).
Evidence tier: T1.
Quality risk: RISKY if treated as free headroom — the same page warns "more context isn't automatically better. As token count grows, accuracy and recall degrade, a phenomenon known as context rot." Compaction/editing stay worthwhile below the hard limit.
Availability: CLAUDE-CODE-TODAY (the [1m] variant / Fable 5 default) + SDK. Bedrock/Vertex 1M listed for Opus 4.8/4.7/4.6/Sonnet 4.6; Fable 5 wording covers the Claude API.
Effort to adopt: none (update spreadsheets, not code).
Composability: see records 2-3 for the quality counterweight; see 12-context-architecture.md.
Validation protocol: n/a for pricing; for quality, recall probes at 100k/300k/600k context on your tasks before relying on deep-window placement.

10. Tokenizer-aware model offload — cheap models also count fewer tokens

Layer: infra / model routing.
Mechanism: Pricing page: "Opus 4.7 and later use a new tokenizer... may use up to 35% more tokens for the same fixed text"; token-counting page: "roughly 30% more tokens than models before Claude Opus 4.7", with the dual-count migration method. Sonnet 4.6/Haiku 4.5 keep the pre-4.7 tokenizer — locally confirmed identical counts (table above).
Expected savings: offloading to Sonnet/Haiku saves twice: lower $/tok AND fewer tokens billed. Measured: +37.9% prose / +27.7% code / +8.9% JSON on Fable 5 for identical text → ESTIMATE effective input-price ratio Fable 5:Sonnet 4.6 ≈ ($10 × 1.28-1.38)/$3 ≈ 4.3-4.6x on prose/code (sticker says 3.3x); Fable 5:Haiku ≈ 13x. Same multiplier applies to context-window occupancy.
Evidence tier: T1 (vendor-stated band + locally reproduced).
Quality risk: QUALITY-TRADE per task — standard downgrade risk; route only eval-passed task classes.
Availability: CLAUDE-CODE-TODAY (Task subagent model field) + SDK routing.
Effort to adopt: minutes (subagent model selection) to hours (router rule).
Composability: compounds with batch (0.5x) and caching (0.1x) on the cheap model. Caveat: compaction CANNOT delegate summarization to a cheaper model (record 2) — the lever is unavailable exactly where it would shine. Model switch busts cache (caches are model-scoped) — offload at task boundaries, not mid-session.
Validation protocol: dual-count your real workload per content type (the docs' own method); then per-task-class A/B pass-rates Sonnet vs Fable before routing.

11. Tool-definition overhead accounting — know the fixed taxes

Layer: input (fixed overhead) / observability.
Mechanism: Pricing page itemizes: tool-use system prompt tax (auto/none vs any/tool): Opus 4.8 = 290/410, Opus 4.7 = 675/804, Opus 4.6 & Sonnet 4.6 = 497/589, Opus 4.5/Sonnet 4.5/Haiku 4.5 = 496/588 — Fable 5 ABSENT (local 403 reproduction pins it ≈ Opus 4.8's 290). Per-tool adders: bash +245; text_editor_20250429 +700; computer use +735/tool & +466-499 system. Web search $10/1k searches with results billed as input in all later turns; web fetch $0 beyond tokens (use max_content_tokens; docs: ~2,500 tok avg page, ~25k big doc page, ~125k per 500kB PDF).
Expected savings: micro per request (all prefix → cached after first write); the real lever is knowing web search has a per-use fee + persistent token tail while web fetch is token-only, and that schema bloat multiplies by every cache re-write. ESTIMATE: a Claude Code-like setup (290 system + 245 bash + 700 editor + ~15 schemas) ≈ 3-6k fixed prefix tokens.
Evidence tier: T1 (published table) + local exact reproduction (403).
Quality risk: NEUTRAL (accounting, not behavior).
Availability: CLAUDE-CODE-TODAY / SDK.
Effort to adopt: none — measurement only.
Composability: caching absorbs all of it after first write; defer_loading removes the schema share but not the system tax.
Validation protocol: count_tokens with/without each tool on your exact toolset (script in this file's method section reproduces it in ~10 API calls, $0).

12. count_tokens — free measurement infrastructure

Layer: infra / observability.
Mechanism: POST /v1/messages/count_tokens. "Token counting is free to use"; RPM by tier: 1→100, 2→2,000, 3→4,000, 4→8,000; "Token counting and message creation have separate and independent rate limits." Counts are estimates; "You are not billed for system-added tokens." Supports system, tools, images, PDFs, thinking and context_management previews (record 3) — you can price a context-editing config before running it. Caveat found locally: it validates thinking-block signatures (400 on fabricated ones) and per its FAQ never exercises caching logic.
Expected savings: zero direct; enables every other measurement. This file verified three doc claims with it at $0.
Evidence tier: T1 + local use.
Quality risk: NEUTRAL.
Availability: SDK (and trivially scriptable from Claude Code, as done here — /tmp/ct.py).
Effort to adopt: minutes.
Composability: pairs with context_management previews and dual-model tokenizer audits.
Validation protocol: n/a — it IS the validation tool; sanity-check against usage on one live call.

13. Structured outputs / strict tool use — retry elimination, not a discount

Layer: output (reliability).
Mechanism: output_config.format (json_schema) + strict: true tools (old beta header deprecated). Grammar compiles on first request, cached 24h from last use; only schema STRUCTURE changes invalidate it. Limits: 20 strict tools/request; adds a small schema-explanation system injection. Incompatible with PTC (strict tools can't be code-called) and citations.
Expected savings: no vendor numbers. ESTIMATE: retry-on-malformed-JSON pipelines spend ≈ 1/(1-p) of base; eliminating p=10% saves ~9% of pipeline spend in 5x-priced output tokens (T4, workload-dependent).
Evidence tier: T1 mechanics; T4 magnitude.
Quality risk: NEUTRAL-to-positive (guaranteed parseable output).
Availability: SDK (Claude Code's own StructuredOutput tool is harness-level, not this feature).
Effort to adopt: hours.
Composability: composes with defer_loading without grammar recompilation; conflicts with PTC.
Validation protocol: measure malformed-output rate and output token count, prompted-JSON vs constrained, on 200 extraction calls.

14. Cross-provider triangulation — OpenAI and Gemini caching baselines

Layer: competitive baseline (GATEWAY-OR-SELF-HOST from a Claude Code perspective).
Mechanism: OpenAI (developers.openai.com): caching automatic at ≥1,024 tokens, writes free, "can reduce latency by up to 80% and input token costs by up to 90%", in-memory retention 5-10 min up to 1h, extended retention up to 24h at no extra charge, optional prompt_cache_key routing — best-effort, no guaranteed hit rate. Gemini (ai.google.dev): implicit caching default on 2.5+ (min 2,048-4,096 tok), explicit caches billed at ~0.1x reads (3.5 Flash $0.15 vs $1.50; 2.5 Pro $0.125 vs $1.25) PLUS storage $1.00 (Flash) / $4.50 (Pro) per MTok per HOUR; batch 50% — matching Anthropic exactly.
Expected savings: No direct Claude Code saving; this is a comparative baseline. Anthropic's model (paid writes, guaranteed 0.1x reads, free TTL refresh, $0 storage) wins for bursty interactive sessions; Gemini's storage-rent model only wins at sustained traffic (ESTIMATE: a 500k-token explicit cache on a Pro-class model costs $2.25/hr in rent before any read); OpenAI is cheapest when its best-effort hit rate cooperates. Convergence: 0.1x-class reads and 50% batch discounts look like settled industry constants as of mid-2026.
Evidence tier: T1.
Quality risk: NEUTRAL.
Availability: different providers.
Effort to adopt: n/a.
Composability: n/a — context for negotiation/architecture choices.
Validation protocol: n/a (baseline only; not a recommendation to switch).

Claims to kill (folklore checked against live sources)

KILL "1M context costs a premium above 200k" — every current 1M model bills flat ("900k... same per-token rate as a 9k-token request"). The premium was real only for the 2025 context-1m-2025-08-07 beta era.
KILL "token-efficient-tools-2025-02-19 header cuts tool output up to 70%" — migration guide: the header "has no effect" on Claude 4+; built in.
KILL "hide/omit thinking to save money" — "Omitting reduces latency, not cost"; "Under every setting, thinking happens and is billed the same." Corollary: visible thinking tokens ≠ billed thinking tokens; use output_tokens_details.thinking_tokens.
KILL (nuanced) "prior-turn thinking is always stripped, so reasoning never costs input" — false on Opus 4.5+/Sonnet 4.6+/Fable 5 (kept, billed as input). Two Anthropic pages (context-windows, token-counting) still propagate the dead claim.
KILL "cap Fable 5 thinking with budget_tokens / disable thinking" — disabled rejected on Fable 5; budget_tokens 400s on Opus 4.8/4.7. Only effort, prompt steering, max_tokens remain.
KILL "batch doesn't combine with caching" — FAQ: "Batch API and prompt caching discounts can be combined" (≈0.05x batched cache read).
KILL "PTC always saves tokens" — vendor's own τ²-bench result: "cost roughly 8% more" on sequential workflows.
KILL "Anthropic caching needs manual breakpoints" — automatic top-level mode exists (Claude API / Claude Platform on AWS / MS Foundry beta; still NOT Bedrock/Vertex).
KILL "Anthropic caching needs 1,024+ token prefixes" — wrong in both directions: Fable 5 = 512, but Haiku 4.5 and Opus 4.6/4.5 = 4,096.
KILL "OpenAI caching = 50% off" — current guide says up to 90%; 50% is the GPT-4o-era figure.
FLAG (contradicts local measurement): pricing-FAQ heuristic "1 token ≈ 4 characters" of English — measured 2.81 chars/tok on Fable 5 markdown prose (3.88 on Sonnet 4.6). Roughly right for the old tokenizer, undercounts Fable 5 token volume by ~30-40% on prose.

Gaps — ranked next measurements

Per-effort-level token deltas (the only lever on 20% of dollars; vendor publishes nothing). Cheap: fixed task set × 4 levels, read output_tokens_details.thinking_tokens.
Prior-turn thinking retention billing — docs contradict; local probe blocked by signature validation. Needs one real signed transcript diffed via count_tokens + live usage.input_tokens.
Compaction quality/savings — zero published numbers, same-model-only summarization, and Claude Code's separate client-side auto-compact has a live mis-trigger regression (issue #52981). Nobody has benchmarked server-compact-at-150k vs Claude Code auto-compact.
Official Fable 5 row in the tool-overhead table — vendor omission; local 403 exact-match is the best current evidence.
Claude Code exposure gap — effort and deferred MCP tools are surfaced; context_editing, clear_thinking, compact_20260112, memory have no Claude Code knobs, so the strongest published-number levers (84%) are SDK-only today.
Independent replication — ALL headline percentages (84/39/29, >85, 38/24, 20-40) are vendor self-reported on internal evals; the τ²-bench negative result raises confidence in honesty, but no third-party replication surfaced in this sweep (T1-vendor, not T3).
Fast mode as a spend-more lever (Opus 4.8 at $10/$50 — Fable 5 sticker for an Opus model; Opus 4.6/4.7 at $30/$150 = 6x; research preview, no batch, stacks with caching) deserves a quality-per-dollar comparison vs Fable 5 at high effort.

Verification ledger

Number / claim	Source or local method
Fable 5 $10/$50; Opus 4.8 $5/$25; Sonnet 4.6 $3/$15; Haiku 4.5 $1/$5; cache write $12.50/5m, $20/1h, read $1 (Fable 5)	platform.claude.com/docs/en/about-claude/pricing
Cache multipliers 1.25x/2x/0.1x; pays off after 1 read (5m) / 2 reads (1h); "multipliers stack"; batch+cache "can be combined"	same pricing page
Batch = exactly 50% per model (Fable 5 $5/$25 ... Haiku $0.50/$2.50); fast-mode and Managed-Agents exclusions	same pricing page
1M standard pricing + "900k-token request ... same per-token rate as a 9k-token request"	same pricing page (Long context section)
Fast mode: Opus 4.8 $10/$50; Opus 4.6/4.7 $30/$150; research preview; stacks with caching/data-residency; no batch	same pricing page
Tool-use system tax table (290/410, 675/804, 497/589, 496/588; Fable 5 absent); bash +245; text editor +700; computer use 735 & 466-499; web search $10/1k; web fetch $0, ~2,500/~25k/~125k tok examples	same pricing page
Code execution 1,550 free hrs/org/mo, $0.05/hr, 5-min minimum, free with web search/fetch; data-residency 1.1x; "1 token ≈ 4 characters" FAQ	same pricing page
Min cacheable prefix: Fable 5/Mythos 5 = 512 (Bedrock 1,024); Opus 4.8 = 1,024; 4.7 = 2,048; 4.6/4.5 = 4,096; Sonnet 4.6/4.5 = 1,024; Haiku 4.5 = 4,096; 4 breakpoints; 20-block lookback; automatic mode platforms; free TTL refresh; invalidation hierarchy	platform.claude.com/docs/en/build-with-claude/prompt-caching
Compaction: beta compact-2026-01-12, type compact_20260112, 150k default / 50k min, iterations example 180k+3.5k & 23k+1k (203k in / 4.5k out vs top-level 23k/1k), same-model-only, content:null tool failure, re-apply free	platform.claude.com/docs/en/build-with-claude/compaction
Context editing: types/dates, beta context-management-2025-06-27, 100k trigger / keep 3, cache invalidation warning, count_tokens preview (original_input_tokens), clear_thinking-first ordering, retention defaults	platform.claude.com/docs/en/build-with-claude/context-editing
29% / 39% / 84%	claude.com/blog/context-management (verbatim)
Thinking kept+billed on Opus 4.5+/Sonnet 4.6+ ("keep them in context by default", "(input tokens)"); "Omitting reduces latency, not cost"; "billed the same"; output_tokens_details.thinking_tokens; adaptive-mode cache preservation; Fable 5 always-on thinking; budget_tokens 400s	platform.claude.com/docs/en/build-with-claude/adaptive-thinking
Stale "automatically stripped" text; 1M default on Fable 5; 128k max output; Foundry 200k for Opus 4.8; context-rot warning; overflow stop_reason; compaction-as-primary positioning	platform.claude.com/docs/en/build-with-claude/context-windows
count_tokens free; RPM 100/2,000/4,000/8,000; separate limits; estimates + system-added tokens unbilled; prior-thinking "do not count" note; "roughly 30% more tokens" + dual-count method; 403 reference output	platform.claude.com/docs/en/build-with-claude/token-counting
Tool search: ~55k example, ">over 85%" reduction, 30-50 tool degradation, "prefix is untouched, so prompt caching is preserved", 10,000 max, 3-5 refs, <10-tools guidance, no beta header	platform.claude.com/docs/en/agents-and-tools/tool-use/tool-search-tool
PTC: +11%/-24% (BrowseComp+DeepSearchQA), -38% (75-tool), 20-40% production band, τ²-bench "roughly 8% more", billing exclusion quote, platform matrix, strict/MCP exclusions	platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling
77K→8.7K (85%, 95% of context preserved), 49%→74%, 79.5%→88.1%, 134K internal worst case, 43,588→27,297 (37%)	anthropic.com/engineering/advanced-tool-use
Effort: levels, high default = omitting, low "Significant token savings with some capability reduction", affects all tokens/tool calls, Sonnet 4.6 medium recommendation, Opus 4.8 "high on all surfaces", ultracode = xhigh + multi-agent permission, max overthinking warning, 64k max_tokens guidance, NO per-level percentages	platform.claude.com/docs/en/build-with-claude/effort
Structured outputs: output_config.format, 24h grammar cache, structure-only invalidation, 20 strict tools, deprecated beta header	platform.claude.com/docs/en/build-with-claude/structured-outputs (sweep verification)
token-efficient-tools / output-128k headers "have no effect" on Claude 4+	platform.claude.com migration guide (sweep verification)
Claude Code auto-compact misfire at 84k/1M, every 1-2 min, regression v2.1.112	github.com/anthropics/claude-code/issues/52981
OpenAI: automatic ≥1,024 tok, free writes, "up to 90%" input cost / "up to 80%" latency, 5-10min→1h + 24h extended retention free, prompt_cache_key	developers.openai.com/api/docs/guides/prompt-caching (sweep verification)
Gemini: implicit default on 2.5+ (2,048/4,096 min), 0.1x explicit reads ($0.125-$0.20 vs $1.25-$2.00), storage $1.00/$4.50 per MTok/hr, batch 50%	ai.google.dev/gemini-api/docs/caching + /pricing (sweep verification)
Tokenizer deltas +37.9% prose / +27.7% Rust / +8.9% JSON; Fable 5 ≡ Opus 4.8; Sonnet 4.6 ≡ Haiku 4.5; 2.81 vs 3.88 chars/tok	local measurement, /tmp/ct.py vs count_tokens API (samples: repo README/stories.rs/8x get_weather JSON)
Tool overhead: 403/19 (Fable 5 & Opus 4.8), 793/24 (4.7), 587/16 (Sonnet 4.6), 586/16 (Haiku 4.5)	local measurement, /tmp/ct.py with --tools (exact match to docs' 403 example)
count_tokens rejects fabricated thinking signatures (400)	local measurement, /tmp/ct.py
54.8% thinking share; 0.44/6.73/92.83% prompt mix; 32/29/20/17/2% dollar split; ~$22/day; 1,420 vs ~60 tok MCP schemas; caveman/wenyan cuts	local Phase-0 measurements (see 01-economics-and-measurement.md, 02-baseline-audit.md)
~$85 vs ~$22 no-cache counterfactual; 0.05x batched cache read; 4.3-4.6x effective Fable:Sonnet ratio; 150k-vs-1M ≈ 85% cap; 10% session cut per 50% thinking reduction	ESTIMATE — arithmetic shown inline on the modeled session profile

18 — Provider-level features, current and beta

On this page