18 — Provider-level features, current and beta
18 — Provider-level features, current and beta
TL;DR
- Prompt caching is the lever already running: 0.1x reads / 1.25x (5m) or 2x (1h) writes, free TTL refresh on every hit, and Fable 5 now has the LOWEST minimum cacheable prefix ever shipped (512 tokens vs 4,096 on Opus 4.6/Haiku 4.5). Without the 0.1x read rate, the modeled heavy day (~$22) would cost ~$85 (ESTIMATE, arithmetic below). All T1, verified live.
- The two biggest published-number levers for agent workloads: context editing (vendor: 84% token cut AND +29% task performance; with memory tool +39%) and tool search/defer_loading (vendor: >85% tool-context cut, 77K→8.7K tokens, accuracy 49%→74% on Opus 4; locally reproduced at ~96% on 11 MCP schemas). Both are SDK-side; neither is a Claude Code knob.
- The 1M-context premium is dead: Fable 5/Opus 4.8-4.6/Sonnet 4.6 bill flat per-token across the window ("A 900k-token request is billed at the same per-token rate as a 9k-token request"), and 1M is the DEFAULT on Fable 5 — every 2025 cost model with a 200k pricing cliff is wrong.
- Thinking spend (54.8% of output tokens, ~20% of dollars in the local max-effort session) has exactly one API control on Fable 5:
output_config.effort. Anthropic publishes NO per-level percentages — only qualitative guidance — so the magnitude is the dossier's biggest open measurement (T4). - Two of Anthropic's own live pages still say prior-turn thinking is "automatically stripped"; the adaptive-thinking and context-editing pages say Opus 4.5+/Sonnet 4.6+/Fable 5 keep ALL prior thinking turns in context and bill them as input. The contradiction is live and worth real money on long sessions.
Method: every doc claim re-fetched live from platform.claude.com (docs.claude.com redirects there) / claude.com / anthropic.com / github.com; cheap-to-check token claims reproduced locally via the free count_tokens harness (/tmp/ct.py, OAuth, exact-model counts). Modeled session profile for arithmetic (local Phase-0 measurement): heavy day ≈ 6 sessions × (19 calls, 5.5k uncached in, 85k cache-write, 1.17M cache-read, 27k out @ 55% thinking) ≈ $22/day at Fable 5 list prices — internally consistent with the live price table to within 1%.
Where the local dollars go, and which provider lever attacks each bucket
| Bucket (local split of ~$22/day) | $/day | Provider levers |
|---|---|---|
| Cache reads 32% | ~$7.0 | compaction (caps re-read context), context editing, tool search (smaller prefix), 1M discipline |
| Cache writes 29% | ~$6.4 | compaction (shorter prefixes re-written), avoiding cache busts (defer_loading, stable thinking mode) |
| Thinking output 20% | ~$4.4 | effort (only lever on Fable 5), per-message steering |
| Visible output 17% | ~$3.7 | effort (affects ALL response tokens), see 15-output-discipline.md |
| Uncached input 2% | ~$0.4 | nothing needed — caching already absorbs it |
Local measurements (count_tokens API, method: /tmp/ct.py)
Tokenizer comparison — identical text counted per model (samples: 2.0kB markdown prose from this repo, 2.0kB Rust, 2.9kB JSON tool schemas):
| Sample | chars | Fable 5 | Opus 4.8 | Sonnet 4.6 | Haiku 4.5 | Fable vs Sonnet | chars/tok (Fable / Sonnet) |
|---|---|---|---|---|---|---|---|
| prose | 1,985 | 706 | 706 | 512 | 512 | +37.9% | 2.81 / 3.88 |
| code (Rust) | 2,000 | 806 | 806 | 631 | 631 | +27.7% | 2.48 / 3.17 |
| JSON schema | 2,898 | 969 | 969 | 890 | 890 | +8.9% | 2.99 / 3.26 |
Fable 5 ≡ Opus 4.8 and Sonnet 4.6 ≡ Haiku 4.5 byte-for-byte on every sample — two tokenizer families confirmed. The +27.7% on this Rust sample is above the Phase-0 ~15% code figure (different sample) but inside the vendor's "up to 35%" band; structured JSON pays the smallest penalty.
Tool-use overhead — docs' reference request (get_weather tool, "What's the weather like in San Francisco?"):
| Model | with tool | no tool | delta | documented system tax (auto/none) |
|---|---|---|---|---|
| Fable 5 | 403 | 19 | 384 | absent from pricing table |
| Opus 4.8 | 403 | 19 | 384 | 290 |
| Opus 4.7 | 793 | 24 | 769 | 675 |
| Sonnet 4.6 | 587 | 16 | 571 | 497 |
| Haiku 4.5 | 586 | 16 | 570 | 496 |
The 403 is an EXACT reproduction of the token-counting page's published example output, and Fable 5 matches Opus 4.8 identically — pinning Fable 5's undocumented tool-use system tax at the lean 290-token Opus 4.8 prompt, not 4.7's 675 (a 57% quiet drop between 4.7 and 4.8).
One negative result: count_tokens REJECTS fabricated thinking-block signatures (400: Invalid signature in thinking block, reproduced on Fable 5 and Sonnet 4.6) — so the prior-turn-thinking billing question cannot be settled with the free harness alone; it needs a real signed transcript (see record 5 and Gaps).
Technique records
1. Prompt caching (incl. automatic mode) — the lever already paying for itself
- Layer: cache / input.
- Mechanism: Opt-in
cache_control. Automatic mode: one top-level field, system places/moves a single breakpoint on the last cacheable block (Claude API, Claude Platform on AWS, MS Foundry beta — NOT Bedrock/Vertex). Explicit mode: up to 4 breakpoints, 20-block lookback per breakpoint. Writes 1.25x (5m TTL) / 2x (1h); reads 0.1x; "The cache is refreshed for no additional cost each time the cached content is used." Min cacheable prefix (live table): Fable 5/Mythos 5 = 512 (1,024 on Bedrock), Opus 4.8 = 1,024, Opus 4.7 = 2,048, Opus 4.6/4.5 = 4,096, Sonnet 4.6/4.5 = 1,024, Haiku 4.5 = 4,096. Invalidation hierarchy tools→system→messages: tool-definition edits bust everything; tool_choice/images/thinking-parameter changes bust messages only. - Expected savings: already realized in the baseline: local heavy session prompt mix is 0.44% uncached / 6.73% cache-write / 92.83% cache-read; reads+writes = 61% of dollars. ESTIMATE counterfactual: 7.02M cache-read tok/day at 1.0x instead of 0.1x = $70.2 vs $7.0 → day total ~$85 vs ~$22 (~3.9x). Pays off after 1 read (5m) or 2 reads (1h) — pricing page's own arithmetic.
- Evidence tier: T1 (live caching + pricing pages) + local baseline measurement.
- Quality risk: NEUTRAL — identical model output; pure price arbitrage. Falsification: none needed; verify via
usage.cache_read_input_tokens > 0. - Availability: CLAUDE-CODE-TODAY (applied automatically; the 92.83% read share is the proof) + SDK for breakpoint/TTL control.
- Effort to adopt: zero in Claude Code; minutes in API code. 1h TTL only wins when request gaps exceed 5 min — free refresh makes 5m TTL sufficient for active sessions.
- Composability: multipliers stack with Batch 50% and data-residency 1.1x ("These multipliers stack with other pricing modifiers"). Tension: context editing/compaction invalidate cache at the edit point; defer_loading and stable adaptive thinking preserve it. See 13-caching-exploitation.md.
- Validation protocol: for any session, assert
cache_read_input_tokens / (input + cache_creation + cache_read) > 0.85across calls 2..N; alert on zero reads with identical prefixes (silent invalidator). Distinguish 5m vs 1h writes via the per-TTL cache_creation breakdown fields in usage.
2. Server-side compaction (compact_20260112) — API-native auto-summarization
- Layer: turn-structure / context lifecycle.
- Mechanism: Beta
anthropic-beta: compact-2026-01-12, edit{"type": "compact_20260112"}under context_management. Default trigger 150,000 input tokens (minimum 50,000). API emits acompactionblock; on later requests everything before it is dropped server-side. Optionalpause_after_compactionand custominstructions(REPLACE the default prompt entirely). Billing: summarization is a separate iteration — doc example: compaction 180,000 in + 3,500 out, message 23,000 in + 1,000 out → 203k in + 4.5k out total, while top-level usage shows only 23k/1k (~88% of the turn invisible to naive dashboards). Re-applying an existing compaction block costs nothing. Models: Fable 5, Mythos 5/Preview, Opus 4.8/4.7/4.6, Sonnet 4.6. Summarization runs on the SAME model — "There is no option to use a different (for example, cheaper) model"; with tools defined the model may call a tool instead of summarizing (content: null) unless instructions forbid it. - Expected savings: no vendor savings/quality numbers published (T4 for magnitude). ESTIMATE: capping per-turn re-read context at 150k instead of growing toward 1M caps the cache-read line at ~15% of worst case ($0.15 vs $1.00 per Fable 5 turn at 0.1x); on the local profile the cache-read bucket (32% of dollars) scales with retained context.
- Evidence tier: T1 mechanics/billing (live compaction page); T4 net savings.
- Quality risk: QUALITY-TRADE — summary is lossy by construction; zero published evals. Degradation: forgotten constraints/decisions after the boundary. Falsification: continue 20 real coding tasks from compacted vs full history; diff task success and re-asked-question rate.
- Availability: SDK (Messages API beta). NOT a Claude Code knob — Claude Code ships its own client-side auto-compact, a separate implementation with at least one live mis-trigger regression: auto-compact firing at 84k/1M (8% usage) every 1-2 min in v2.1.112 (github.com/anthropics/claude-code/issues/52981).
- Effort to adopt: hours (one edits entry + append full
response.contentso compaction blocks persist; track cost viausage.iterations[], populated only on turns that trigger a new compaction). - Composability: pairs with memory tool (state survives the boundary); put a cache breakpoint at end of system prompt so compactions don't bust the system prefix; compaction blocks accept cache_control. Each compaction = one cache re-write of the new shorter prefix.
- Validation protocol: A/B a 500k-token scripted agent run, compaction-at-150k vs raw growth: compare total $ (summed over iterations[]), task completion, and factual-recall probes referencing pre-boundary content.
3. Context editing: clear_tool_uses + clear_thinking — the rare negative-cost lever (vendor-claimed)
- Layer: turn-structure / context lifecycle.
- Mechanism: Beta
context-management-2025-06-27.clear_tool_uses_20250919: clears oldest tool results past trigger (defaults: 100,000-token trigger, keep 3 most-recent tool uses; optional clear_at_least / exclude_tools / clear_tool_inputs).clear_thinking_20251015: keep last N thinking turns; must be listed FIRST when combined. Applied server-side pre-inference; client keeps full history. count_tokens previews the effect (context_management.original_input_tokensvs post-editinput_tokens). - Expected savings: vendor (claude.com/blog/context-management, verbatim): "Context editing alone delivered a 29% improvement"; "combining the memory tool with context editing improved performance by 39%"; 100-turn web-search eval "reducing token consumption by 84%". On the local profile an 84%-class cut applies to the 61%-of-dollars cache block → ESTIMATE up to ~$11/day ceiling if it transferred fully to coding (unproven).
- Evidence tier: T1 shipped + vendor-published numbers; NO independent replication found — treat magnitude as vendor-internal (web-search eval, not coding).
- Quality risk: NEGATIVE-COST per vendor (tokens down AND +29% performance — stale tool results are anti-signal); treat as RISKY on coding until replicated, since cleared results are unrecoverable without memory. Falsification: replicate on Read/Grep-heavy coding tasks; check for re-reading of cleared files.
- Availability: SDK (also Bedrock/Vertex per launch post). Not a Claude Code knob.
- Effort to adopt: hours — one context_management block; tune
clear_at_leastso each cache bust buys a big clear. - Composability: DIRECT CACHE TENSION (live page): clearing "invalidates cached prompt prefixes"; kept thinking preserves cache. Integrates with memory tool. Anti-synergy with compaction if both fire on the same window — pick one primary strategy (docs position compaction as primary).
- Validation protocol: fixed 30-task coding suite, edits on/off: compare success rate, $/task (cache-write churn included), and count_tokens previews vs realized billing.
4. Effort parameter + adaptive thinking — the only thinking-spend control on Fable 5
- Layer: output.
- Mechanism:
output_config: {effort: low|medium|high|xhigh|max}, GA, no beta header;highdefault ("exactly the same behavior as omitting the parameter"). Affects ALL response tokens: thinking, tool-call count, preamble ("lower effort would mean Claude makes fewer tool calls").maxon Fable 5/Mythos/Opus 4.8/4.7/4.6/Sonnet 4.6;xhighonly Fable 5/Mythos 5/Opus 4.8/4.7. On Fable 5 thinking is always-on ({type:"disabled"}rejected);budget_tokens400s on Opus 4.8/4.7, deprecated on 4.6/Sonnet 4.6.max_tokensis the hard cap (docs: start 64k at xhigh/max). Claude Code: default high (effort page: Opus 4.8 "default ishighon all surfaces, including the Claude API and Claude Code"); "ultracode" is NOT an API level — it isxhigh+ standing multi-agent permission injected via mid-conversation system messages. Observability:usage.output_tokens_details.thinking_tokenssplits billed reasoning from response (strictly better than the Phase-0 subtraction method).display: omitted/summarizednever changes billing. - Expected savings: NO published per-level percentages anywhere (confirmed by full-page read) — only qualitative ("low: Significant token savings with some capability reduction"; Sonnet 4.6 recommended default medium). Local anchor: thinking = 54.8% of output tokens at max effort (n=1) and output = 37% of session dollars → ESTIMATE a 50% thinking cut ≈ 10% of session cost (0.5 × 20%), plus second-order input savings from fewer tool calls.
- Evidence tier: T1 mechanics; T4 magnitude per level (vendor publishes none — measure locally).
- Quality risk: QUALITY-TRADE below high (docs: "accepting some reduction in capability"; at low/medium Opus 4.7+ "scopes its work to what was asked"). Conversely dropping max→high can be NEGATIVE-COST: docs warn
max"can lead to overthinking" on structured-output / less intelligence-sensitive tasks. - Availability: CLAUDE-CODE-TODAY (effort menu) + SDK.
- Effort to adopt: minutes (menu item / one field); the real work is the eval.
- Composability: consecutive adaptive-mode requests preserve cache; SWITCHING thinking mode busts message cache (system/tools survive). Per-message steering ("Answer directly without deliberating.") works with no parameter change at all → zero cache impact.
- Validation protocol: fixed Claude Code task set at low/medium/high/xhigh; read
output_tokens_details.thinking_tokensper run; report tokens + pass-rate per level. This closes the dossier's biggest missing number and re-bases the 54.8% n=1. See 31-validation-harness.md.
5. Prior-turn thinking retention — the hidden input-side cost of reasoning models
- Layer: input / context lifecycle.
- Mechanism: Adaptive-thinking page (verbatim): "whether the API keeps or strips them depends on the model: Opus 4.5+ and Sonnet 4.6+ keep them in context by default; earlier Opus/Sonnet models and all Haiku models strip them"; pricing section bills "Thinking blocks from prior assistant turns kept in context: ... all turns by default on Opus 4.5+ and Sonnet 4.6+ (input tokens)". Reclaim via
clear_thinking_20251015(keep: N). LIVE CONTRADICTION (both pages): context-windows page still says "previous thinking blocks are automatically stripped from the context window calculation by the Claude API"; token-counting page says prior-turn thinking "do not count toward your input tokens" (plausibly true for the counting endpoint specifically, while create keeps and bills — but the context-windows text is flatly stale). - Expected savings: ESTIMATE: at the local 54.8% thinking share, kept thinking re-enters every later prompt; at 0.1x cache-read it is cheap per token but compounds the 61%-of-dollars cache block over 50-turn sessions.
clear_thinking keep:1-2reclaims most of it at the cost of one cache bust. - Evidence tier: T1 for retention behavior (two live pages agree, two disagree); T4 dollar magnitude. Local probe blocked: count_tokens validates signatures (400 on fabricated blocks, reproduced) — needs a real signed transcript.
- Quality risk: retention presumably aids reasoning continuity; aggressive clearing is RISKY for long multi-step chains, NEUTRAL for fresh-task-per-turn usage. Exception that can never be cleared: the thinking block of an in-flight tool_use cycle must be returned.
- Availability: behavior affects CLAUDE-CODE-TODAY sessions implicitly; the clear_thinking control is SDK-only.
- Effort to adopt: hours (one edits entry) once on the SDK.
- Composability: kept thinking = stable cache-friendly prefix; clearing busts cache at the clear point. Interacts with effort (less thinking generated → less retained).
- Validation protocol: capture a real multi-turn Fable 5 transcript (signed blocks); count_tokens it with and without prior thinking passed back AND diff
usage.input_tokenson live create calls — resolves the docs contradiction empirically and prices the retained share.
6. Tool search tool + defer_loading — stop paying for unused tool schemas
- Layer: input (prefix).
- Mechanism: Two server-side variants, no beta header:
tool_search_tool_regex_20251119(Python re.search, ≤200 chars) andtool_search_tool_bm25_20251119(natural language).defer_loading: truetools are excluded from the system-prompt prefix; on discovery the API appends atool_referenceblock and expands it — verbatim: "The prefix is untouched, so prompt caching is preserved." 3-5 references per search; ≤10,000 tools; expanded references persist across turns. MCP defers viamcp_toolset. Incompatible with tool-use examples; strict-mode grammar builds from the full toolset (no recompilation). - Expected savings: docs: typical 5-server MCP setup "~55k tokens in definitions"; tool search "typically reduces this by over 85%"; accuracy "degrades significantly once you exceed 30-50 available tools". Engineering post: 77K→8.7K upfront ("85% reduction... preserving 95% of context window"); MCP-eval accuracy Opus 4 49%→74%, Opus 4.5 79.5%→88.1%; internal worst case 134K tokens of definitions. Local ground truth: 11 MCP schemas = 1,420 tok loaded vs ~60 tok deferred ≈ 96% cut (Phase-0) — direction and magnitude consistent with vendor.
- Evidence tier: T1 (vendor-published + locally reproduced).
- Quality risk: NEGATIVE-COST for big catalogs (tokens down, accuracy up); NEUTRAL-to-negative under 10 tools / <100 schema tokens (docs' own threshold) where search round-trips just add latency.
- Availability: CLAUDE-CODE-TODAY (this session's harness runs 37 deferred tools behind a ToolSearch tool) + SDK.
- Effort to adopt: minutes-hours: flag per tool + include the search tool; keep the top 3-5 tools non-deferred.
- Composability: the standout property is cache preservation — most context surgery busts cache, this does not. Works in Batch at normal pricing. Composes with strict mode without grammar recompilation.
- Validation protocol: measure per-search overhead (server_tool_use block + expanded definitions) vs schemas saved on your catalog: count_tokens the request with all tools loaded vs deferred + N searches; assert net saving and unchanged task pass-rate.
7. Programmatic tool calling (PTC) — tool results that never enter context
- Layer: input (tool-result volume).
- Mechanism: Requires
code_execution_20260120; tools opt in viaallowed_callers: ["code_execution_20260120"](guidance, "not a security boundary"). Claude writes a script in the sandbox; each in-script tool call pauses execution, client supplies tool_result, script resumes. Verbatim: "Tool results from programmatic invocations do not count toward your input/output token usage. Only the final code execution result and Claude's response count." Platforms: Claude API, Claude Platform on AWS, MS Foundry — NOT Bedrock/Vertex. Exclusions: MCP-connector tools,strict: truetools. - Expected savings: vendor (all verbatim): BrowseComp+DeepSearchQA "improved performance by an average of 11% while using 24% fewer input tokens"; 75-tool benchmark "reduced billed input tokens by roughly 38% with no change in task accuracy"; production traffic 10-49 tools "typical token savings of 20% to 40%". NEGATIVE RESULT, same page: τ²-bench "left scores unchanged and cost roughly 8% more. Sequential single-call workflows do not benefit." Engineering post: 43,588→27,297 tok (37%) on complex research.
- Evidence tier: T1 (multiple vendor benchmarks incl. an honest negative).
- Quality risk: NEGATIVE-COST on fan-out/filter shapes; RISKY (+8% cost, no gain) on sequential reason-per-call shapes — which is exactly the Read/Edit/Bash profile of most coding-agent turns. Do not assume the 20-40% band transfers to Claude Code-like traffic.
- Availability: SDK (server-side containers). Not how Claude Code runs tools; Claude Code subagents achieve a similar result-isolation effect client-side.
- Effort to adopt: days — enable code execution, tag tools, implement the pause/resume protocol.
- Composability: batch-compatible at batch pricing. Container costs: 1,550 free hours/org/month, then $0.05/hr (5-min minimum), FREE when
web_search_20260209/web_fetch_20260209is in the request. Conflicts with strict tools and MCP connector. - Validation protocol: replay a week of real tool traffic shapes through PTC vs direct; segment by fan-out width; adopt only for shapes where billed input drops with flat accuracy (expect the sequential segment to reproduce the τ²-bench negative).
8. Batch API — 50% off everything, stacks with caching
- Layer: infra / pricing modifier.
- Mechanism: Async Messages Batches; live table is exactly 50% on input AND output for every active model (Fable 5 $5/$25, Opus 4.8 $2.50/$12.50, Sonnet 4.6 $1.50/$7.50, Haiku 4.5 $0.50/$2.50). FAQ verbatim: "Batch API and prompt caching discounts can be combined"; multipliers stack → batched cache read = 0.5 × 0.1 = 0.05x base input (ESTIMATE arithmetic from stated stacking). 1M requests batch at the same rates; tool search batches at normal pricing. Excluded: fast mode, Managed Agents sessions.
- Expected savings: exactly 50% of whatever moves to batch — the only lever touching all five local dollar buckets at once, including the 37%-of-dollars output side.
- Evidence tier: T1 (pricing page). SDK docs: most batches < 1h, max 24h (cached skill reference) — re-verify turnaround on your own workload.
- Quality risk: NEUTRAL on quality; cost is latency (results within 24h).
- Availability: SDK only — useless for interactive Claude Code; ideal for nightly review sweeps, eval runs, offline subagent fan-outs.
- Effort to adopt: days (restructure offline workloads into batch submissions).
- Composability: stacks with caching and 1M context; mutually exclusive with fast mode and interactivity.
- Validation protocol: move one nightly workload (e.g. repo-wide code-review sweep) to batch; confirm 0.5x on the invoice line and measure realized turnaround distribution.
9. 1M context at standard pricing — a cost-model correction, not a saving
- Layer: infra / pricing model.
- Mechanism: Pricing page verbatim: Fable 5, Mythos 5/Preview, Opus 4.8/4.7/4.6, Sonnet 4.6 "include the full 1M token context window at standard pricing. (A 900k-token request is billed at the same per-token rate as a 9k-token request.)" Caching and batch apply at standard rates across the window. Context-windows page: 1M is the DEFAULT on Fable 5; 128k max output; Opus 4.8 is 200k on MS Foundry only; Sonnet 4.5 and older stay 200k. Overflow on 4.5+: request accepted, generation stops with
stop_reason: "model_context_window_exceeded". - Expected savings: none directly — kills the 2025-era >200k premium (2x in / 1.5x out) from every old cost model. Flat pricing means bloat is LINEARLY expensive with no cliff: a full 1M Fable 5 prompt = $10 uncached / $1 cached per request (ESTIMATE from list prices).
- Evidence tier: T1.
- Quality risk: RISKY if treated as free headroom — the same page warns "more context isn't automatically better. As token count grows, accuracy and recall degrade, a phenomenon known as context rot." Compaction/editing stay worthwhile below the hard limit.
- Availability: CLAUDE-CODE-TODAY (the [1m] variant / Fable 5 default) + SDK. Bedrock/Vertex 1M listed for Opus 4.8/4.7/4.6/Sonnet 4.6; Fable 5 wording covers the Claude API.
- Effort to adopt: none (update spreadsheets, not code).
- Composability: see records 2-3 for the quality counterweight; see 12-context-architecture.md.
- Validation protocol: n/a for pricing; for quality, recall probes at 100k/300k/600k context on your tasks before relying on deep-window placement.
10. Tokenizer-aware model offload — cheap models also count fewer tokens
- Layer: infra / model routing.
- Mechanism: Pricing page: "Opus 4.7 and later use a new tokenizer... may use up to 35% more tokens for the same fixed text"; token-counting page: "roughly 30% more tokens than models before Claude Opus 4.7", with the dual-count migration method. Sonnet 4.6/Haiku 4.5 keep the pre-4.7 tokenizer — locally confirmed identical counts (table above).
- Expected savings: offloading to Sonnet/Haiku saves twice: lower $/tok AND fewer tokens billed. Measured: +37.9% prose / +27.7% code / +8.9% JSON on Fable 5 for identical text → ESTIMATE effective input-price ratio Fable 5:Sonnet 4.6 ≈ ($10 × 1.28-1.38)/$3 ≈ 4.3-4.6x on prose/code (sticker says 3.3x); Fable 5:Haiku ≈ 13x. Same multiplier applies to context-window occupancy.
- Evidence tier: T1 (vendor-stated band + locally reproduced).
- Quality risk: QUALITY-TRADE per task — standard downgrade risk; route only eval-passed task classes.
- Availability: CLAUDE-CODE-TODAY (Task subagent model field) + SDK routing.
- Effort to adopt: minutes (subagent model selection) to hours (router rule).
- Composability: compounds with batch (0.5x) and caching (0.1x) on the cheap model. Caveat: compaction CANNOT delegate summarization to a cheaper model (record 2) — the lever is unavailable exactly where it would shine. Model switch busts cache (caches are model-scoped) — offload at task boundaries, not mid-session.
- Validation protocol: dual-count your real workload per content type (the docs' own method); then per-task-class A/B pass-rates Sonnet vs Fable before routing.
11. Tool-definition overhead accounting — know the fixed taxes
- Layer: input (fixed overhead) / observability.
- Mechanism: Pricing page itemizes: tool-use system prompt tax (auto/none vs any/tool): Opus 4.8 = 290/410, Opus 4.7 = 675/804, Opus 4.6 & Sonnet 4.6 = 497/589, Opus 4.5/Sonnet 4.5/Haiku 4.5 = 496/588 — Fable 5 ABSENT (local 403 reproduction pins it ≈ Opus 4.8's 290). Per-tool adders: bash +245; text_editor_20250429 +700; computer use +735/tool & +466-499 system. Web search $10/1k searches with results billed as input in all later turns; web fetch $0 beyond tokens (use
max_content_tokens; docs: ~2,500 tok avg page, ~25k big doc page, ~125k per 500kB PDF). - Expected savings: micro per request (all prefix → cached after first write); the real lever is knowing web search has a per-use fee + persistent token tail while web fetch is token-only, and that schema bloat multiplies by every cache re-write. ESTIMATE: a Claude Code-like setup (290 system + 245 bash + 700 editor + ~15 schemas) ≈ 3-6k fixed prefix tokens.
- Evidence tier: T1 (published table) + local exact reproduction (403).
- Quality risk: NEUTRAL (accounting, not behavior).
- Availability: CLAUDE-CODE-TODAY / SDK.
- Effort to adopt: none — measurement only.
- Composability: caching absorbs all of it after first write; defer_loading removes the schema share but not the system tax.
- Validation protocol: count_tokens with/without each tool on your exact toolset (script in this file's method section reproduces it in ~10 API calls, $0).
12. count_tokens — free measurement infrastructure
- Layer: infra / observability.
- Mechanism: POST /v1/messages/count_tokens. "Token counting is free to use"; RPM by tier: 1→100, 2→2,000, 3→4,000, 4→8,000; "Token counting and message creation have separate and independent rate limits." Counts are estimates; "You are not billed for system-added tokens." Supports system, tools, images, PDFs, thinking and context_management previews (record 3) — you can price a context-editing config before running it. Caveat found locally: it validates thinking-block signatures (400 on fabricated ones) and per its FAQ never exercises caching logic.
- Expected savings: zero direct; enables every other measurement. This file verified three doc claims with it at $0.
- Evidence tier: T1 + local use.
- Quality risk: NEUTRAL.
- Availability: SDK (and trivially scriptable from Claude Code, as done here — /tmp/ct.py).
- Effort to adopt: minutes.
- Composability: pairs with context_management previews and dual-model tokenizer audits.
- Validation protocol: n/a — it IS the validation tool; sanity-check against
usageon one live call.
13. Structured outputs / strict tool use — retry elimination, not a discount
- Layer: output (reliability).
- Mechanism:
output_config.format(json_schema) +strict: truetools (old beta header deprecated). Grammar compiles on first request, cached 24h from last use; only schema STRUCTURE changes invalidate it. Limits: 20 strict tools/request; adds a small schema-explanation system injection. Incompatible with PTC (strict tools can't be code-called) and citations. - Expected savings: no vendor numbers. ESTIMATE: retry-on-malformed-JSON pipelines spend ≈ 1/(1-p) of base; eliminating p=10% saves ~9% of pipeline spend in 5x-priced output tokens (T4, workload-dependent).
- Evidence tier: T1 mechanics; T4 magnitude.
- Quality risk: NEUTRAL-to-positive (guaranteed parseable output).
- Availability: SDK (Claude Code's own StructuredOutput tool is harness-level, not this feature).
- Effort to adopt: hours.
- Composability: composes with defer_loading without grammar recompilation; conflicts with PTC.
- Validation protocol: measure malformed-output rate and output token count, prompted-JSON vs constrained, on 200 extraction calls.
14. Cross-provider triangulation — OpenAI and Gemini caching baselines
- Layer: competitive baseline (GATEWAY-OR-SELF-HOST from a Claude Code perspective).
- Mechanism: OpenAI (developers.openai.com): caching automatic at ≥1,024 tokens, writes free, "can reduce latency by up to 80% and input token costs by up to 90%", in-memory retention 5-10 min up to 1h, extended retention up to 24h at no extra charge, optional
prompt_cache_keyrouting — best-effort, no guaranteed hit rate. Gemini (ai.google.dev): implicit caching default on 2.5+ (min 2,048-4,096 tok), explicit caches billed at ~0.1x reads (3.5 Flash $0.15 vs $1.50; 2.5 Pro $0.125 vs $1.25) PLUS storage $1.00 (Flash) / $4.50 (Pro) per MTok per HOUR; batch 50% — matching Anthropic exactly. - Expected savings: No direct Claude Code saving; this is a comparative baseline. Anthropic's model (paid writes, guaranteed 0.1x reads, free TTL refresh, $0 storage) wins for bursty interactive sessions; Gemini's storage-rent model only wins at sustained traffic (ESTIMATE: a 500k-token explicit cache on a Pro-class model costs $2.25/hr in rent before any read); OpenAI is cheapest when its best-effort hit rate cooperates. Convergence: 0.1x-class reads and 50% batch discounts look like settled industry constants as of mid-2026.
- Evidence tier: T1.
- Quality risk: NEUTRAL.
- Availability: different providers.
- Effort to adopt: n/a.
- Composability: n/a — context for negotiation/architecture choices.
- Validation protocol: n/a (baseline only; not a recommendation to switch).
Claims to kill (folklore checked against live sources)
- KILL "1M context costs a premium above 200k" — every current 1M model bills flat ("900k... same per-token rate as a 9k-token request"). The premium was real only for the 2025
context-1m-2025-08-07beta era. - KILL "token-efficient-tools-2025-02-19 header cuts tool output up to 70%" — migration guide: the header "has no effect" on Claude 4+; built in.
- KILL "hide/omit thinking to save money" — "Omitting reduces latency, not cost"; "Under every setting, thinking happens and is billed the same." Corollary: visible thinking tokens ≠ billed thinking tokens; use
output_tokens_details.thinking_tokens. - KILL (nuanced) "prior-turn thinking is always stripped, so reasoning never costs input" — false on Opus 4.5+/Sonnet 4.6+/Fable 5 (kept, billed as input). Two Anthropic pages (context-windows, token-counting) still propagate the dead claim.
- KILL "cap Fable 5 thinking with budget_tokens / disable thinking" —
disabledrejected on Fable 5;budget_tokens400s on Opus 4.8/4.7. Only effort, prompt steering, max_tokens remain. - KILL "batch doesn't combine with caching" — FAQ: "Batch API and prompt caching discounts can be combined" (≈0.05x batched cache read).
- KILL "PTC always saves tokens" — vendor's own τ²-bench result: "cost roughly 8% more" on sequential workflows.
- KILL "Anthropic caching needs manual breakpoints" — automatic top-level mode exists (Claude API / Claude Platform on AWS / MS Foundry beta; still NOT Bedrock/Vertex).
- KILL "Anthropic caching needs 1,024+ token prefixes" — wrong in both directions: Fable 5 = 512, but Haiku 4.5 and Opus 4.6/4.5 = 4,096.
- KILL "OpenAI caching = 50% off" — current guide says up to 90%; 50% is the GPT-4o-era figure.
- FLAG (contradicts local measurement): pricing-FAQ heuristic "1 token ≈ 4 characters" of English — measured 2.81 chars/tok on Fable 5 markdown prose (3.88 on Sonnet 4.6). Roughly right for the old tokenizer, undercounts Fable 5 token volume by ~30-40% on prose.
Gaps — ranked next measurements
- Per-effort-level token deltas (the only lever on 20% of dollars; vendor publishes nothing). Cheap: fixed task set × 4 levels, read
output_tokens_details.thinking_tokens. - Prior-turn thinking retention billing — docs contradict; local probe blocked by signature validation. Needs one real signed transcript diffed via count_tokens + live
usage.input_tokens. - Compaction quality/savings — zero published numbers, same-model-only summarization, and Claude Code's separate client-side auto-compact has a live mis-trigger regression (issue #52981). Nobody has benchmarked server-compact-at-150k vs Claude Code auto-compact.
- Official Fable 5 row in the tool-overhead table — vendor omission; local 403 exact-match is the best current evidence.
- Claude Code exposure gap — effort and deferred MCP tools are surfaced; context_editing, clear_thinking, compact_20260112, memory have no Claude Code knobs, so the strongest published-number levers (84%) are SDK-only today.
- Independent replication — ALL headline percentages (84/39/29, >85, 38/24, 20-40) are vendor self-reported on internal evals; the τ²-bench negative result raises confidence in honesty, but no third-party replication surfaced in this sweep (T1-vendor, not T3).
- Fast mode as a spend-more lever (Opus 4.8 at $10/$50 — Fable 5 sticker for an Opus model; Opus 4.6/4.7 at $30/$150 = 6x; research preview, no batch, stacks with caching) deserves a quality-per-dollar comparison vs Fable 5 at high effort.
Verification ledger
| Number / claim | Source or local method |
|---|---|
| Fable 5 $10/$50; Opus 4.8 $5/$25; Sonnet 4.6 $3/$15; Haiku 4.5 $1/$5; cache write $12.50/5m, $20/1h, read $1 (Fable 5) | platform.claude.com/docs/en/about-claude/pricing |
| Cache multipliers 1.25x/2x/0.1x; pays off after 1 read (5m) / 2 reads (1h); "multipliers stack"; batch+cache "can be combined" | same pricing page |
| Batch = exactly 50% per model (Fable 5 $5/$25 ... Haiku $0.50/$2.50); fast-mode and Managed-Agents exclusions | same pricing page |
| 1M standard pricing + "900k-token request ... same per-token rate as a 9k-token request" | same pricing page (Long context section) |
| Fast mode: Opus 4.8 $10/$50; Opus 4.6/4.7 $30/$150; research preview; stacks with caching/data-residency; no batch | same pricing page |
| Tool-use system tax table (290/410, 675/804, 497/589, 496/588; Fable 5 absent); bash +245; text editor +700; computer use 735 & 466-499; web search $10/1k; web fetch $0, ~2,500/~25k/~125k tok examples | same pricing page |
| Code execution 1,550 free hrs/org/mo, $0.05/hr, 5-min minimum, free with web search/fetch; data-residency 1.1x; "1 token ≈ 4 characters" FAQ | same pricing page |
| Min cacheable prefix: Fable 5/Mythos 5 = 512 (Bedrock 1,024); Opus 4.8 = 1,024; 4.7 = 2,048; 4.6/4.5 = 4,096; Sonnet 4.6/4.5 = 1,024; Haiku 4.5 = 4,096; 4 breakpoints; 20-block lookback; automatic mode platforms; free TTL refresh; invalidation hierarchy | platform.claude.com/docs/en/build-with-claude/prompt-caching |
| Compaction: beta compact-2026-01-12, type compact_20260112, 150k default / 50k min, iterations example 180k+3.5k & 23k+1k (203k in / 4.5k out vs top-level 23k/1k), same-model-only, content:null tool failure, re-apply free | platform.claude.com/docs/en/build-with-claude/compaction |
| Context editing: types/dates, beta context-management-2025-06-27, 100k trigger / keep 3, cache invalidation warning, count_tokens preview (original_input_tokens), clear_thinking-first ordering, retention defaults | platform.claude.com/docs/en/build-with-claude/context-editing |
| 29% / 39% / 84% | claude.com/blog/context-management (verbatim) |
| Thinking kept+billed on Opus 4.5+/Sonnet 4.6+ ("keep them in context by default", "(input tokens)"); "Omitting reduces latency, not cost"; "billed the same"; output_tokens_details.thinking_tokens; adaptive-mode cache preservation; Fable 5 always-on thinking; budget_tokens 400s | platform.claude.com/docs/en/build-with-claude/adaptive-thinking |
| Stale "automatically stripped" text; 1M default on Fable 5; 128k max output; Foundry 200k for Opus 4.8; context-rot warning; overflow stop_reason; compaction-as-primary positioning | platform.claude.com/docs/en/build-with-claude/context-windows |
| count_tokens free; RPM 100/2,000/4,000/8,000; separate limits; estimates + system-added tokens unbilled; prior-thinking "do not count" note; "roughly 30% more tokens" + dual-count method; 403 reference output | platform.claude.com/docs/en/build-with-claude/token-counting |
| Tool search: ~55k example, ">over 85%" reduction, 30-50 tool degradation, "prefix is untouched, so prompt caching is preserved", 10,000 max, 3-5 refs, <10-tools guidance, no beta header | platform.claude.com/docs/en/agents-and-tools/tool-use/tool-search-tool |
| PTC: +11%/-24% (BrowseComp+DeepSearchQA), -38% (75-tool), 20-40% production band, τ²-bench "roughly 8% more", billing exclusion quote, platform matrix, strict/MCP exclusions | platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling |
| 77K→8.7K (85%, 95% of context preserved), 49%→74%, 79.5%→88.1%, 134K internal worst case, 43,588→27,297 (37%) | anthropic.com/engineering/advanced-tool-use |
| Effort: levels, high default = omitting, low "Significant token savings with some capability reduction", affects all tokens/tool calls, Sonnet 4.6 medium recommendation, Opus 4.8 "high on all surfaces", ultracode = xhigh + multi-agent permission, max overthinking warning, 64k max_tokens guidance, NO per-level percentages | platform.claude.com/docs/en/build-with-claude/effort |
| Structured outputs: output_config.format, 24h grammar cache, structure-only invalidation, 20 strict tools, deprecated beta header | platform.claude.com/docs/en/build-with-claude/structured-outputs (sweep verification) |
| token-efficient-tools / output-128k headers "have no effect" on Claude 4+ | platform.claude.com migration guide (sweep verification) |
| Claude Code auto-compact misfire at 84k/1M, every 1-2 min, regression v2.1.112 | github.com/anthropics/claude-code/issues/52981 |
| OpenAI: automatic ≥1,024 tok, free writes, "up to 90%" input cost / "up to 80%" latency, 5-10min→1h + 24h extended retention free, prompt_cache_key | developers.openai.com/api/docs/guides/prompt-caching (sweep verification) |
| Gemini: implicit default on 2.5+ (2,048/4,096 min), 0.1x explicit reads ($0.125-$0.20 vs $1.25-$2.00), storage $1.00/$4.50 per MTok/hr, batch 50% | ai.google.dev/gemini-api/docs/caching + /pricing (sweep verification) |
| Tokenizer deltas +37.9% prose / +27.7% Rust / +8.9% JSON; Fable 5 ≡ Opus 4.8; Sonnet 4.6 ≡ Haiku 4.5; 2.81 vs 3.88 chars/tok | local measurement, /tmp/ct.py vs count_tokens API (samples: repo README/stories.rs/8x get_weather JSON) |
| Tool overhead: 403/19 (Fable 5 & Opus 4.8), 793/24 (4.7), 587/16 (Sonnet 4.6), 586/16 (Haiku 4.5) | local measurement, /tmp/ct.py with --tools (exact match to docs' 403 example) |
| count_tokens rejects fabricated thinking signatures (400) | local measurement, /tmp/ct.py |
| 54.8% thinking share; 0.44/6.73/92.83% prompt mix; 32/29/20/17/2% dollar split; ~$22/day; 1,420 vs ~60 tok MCP schemas; caveman/wenyan cuts | local Phase-0 measurements (see 01-economics-and-measurement.md, 02-baseline-audit.md) |
| ~$85 vs ~$22 no-cache counterfactual; 0.05x batched cache read; 4.3-4.6x effective Fable:Sonnet ratio; 150k-vs-1M ≈ 85% cap; 10% session cut per 50% thinking reduction | ESTIMATE — arithmetic shown inline on the modeled session profile |