jackin'
ResearchToken Optimization Research

18 — Provider-level features, current and beta

18 — Provider-level features, current and beta

TL;DR

  • Prompt caching is the lever already running: 0.1x reads / 1.25x (5m) or 2x (1h) writes, free TTL refresh on every hit, and Fable 5 now has the LOWEST minimum cacheable prefix ever shipped (512 tokens vs 4,096 on Opus 4.6/Haiku 4.5). Without the 0.1x read rate, the modeled heavy day (~$22) would cost ~$85 (ESTIMATE, arithmetic below). All T1, verified live.
  • The two biggest published-number levers for agent workloads: context editing (vendor: 84% token cut AND +29% task performance; with memory tool +39%) and tool search/defer_loading (vendor: >85% tool-context cut, 77K→8.7K tokens, accuracy 49%→74% on Opus 4; locally reproduced at ~96% on 11 MCP schemas). Both are SDK-side; neither is a Claude Code knob.
  • The 1M-context premium is dead: Fable 5/Opus 4.8-4.6/Sonnet 4.6 bill flat per-token across the window ("A 900k-token request is billed at the same per-token rate as a 9k-token request"), and 1M is the DEFAULT on Fable 5 — every 2025 cost model with a 200k pricing cliff is wrong.
  • Thinking spend (54.8% of output tokens, ~20% of dollars in the local max-effort session) has exactly one API control on Fable 5: output_config.effort. Anthropic publishes NO per-level percentages — only qualitative guidance — so the magnitude is the dossier's biggest open measurement (T4).
  • Two of Anthropic's own live pages still say prior-turn thinking is "automatically stripped"; the adaptive-thinking and context-editing pages say Opus 4.5+/Sonnet 4.6+/Fable 5 keep ALL prior thinking turns in context and bill them as input. The contradiction is live and worth real money on long sessions.

Method: every doc claim re-fetched live from platform.claude.com (docs.claude.com redirects there) / claude.com / anthropic.com / github.com; cheap-to-check token claims reproduced locally via the free count_tokens harness (/tmp/ct.py, OAuth, exact-model counts). Modeled session profile for arithmetic (local Phase-0 measurement): heavy day ≈ 6 sessions × (19 calls, 5.5k uncached in, 85k cache-write, 1.17M cache-read, 27k out @ 55% thinking) ≈ $22/day at Fable 5 list prices — internally consistent with the live price table to within 1%.

Where the local dollars go, and which provider lever attacks each bucket

Bucket (local split of ~$22/day)$/dayProvider levers
Cache reads 32%~$7.0compaction (caps re-read context), context editing, tool search (smaller prefix), 1M discipline
Cache writes 29%~$6.4compaction (shorter prefixes re-written), avoiding cache busts (defer_loading, stable thinking mode)
Thinking output 20%~$4.4effort (only lever on Fable 5), per-message steering
Visible output 17%~$3.7effort (affects ALL response tokens), see 15-output-discipline.md
Uncached input 2%~$0.4nothing needed — caching already absorbs it

Local measurements (count_tokens API, method: /tmp/ct.py)

Tokenizer comparison — identical text counted per model (samples: 2.0kB markdown prose from this repo, 2.0kB Rust, 2.9kB JSON tool schemas):

SamplecharsFable 5Opus 4.8Sonnet 4.6Haiku 4.5Fable vs Sonnetchars/tok (Fable / Sonnet)
prose1,985706706512512+37.9%2.81 / 3.88
code (Rust)2,000806806631631+27.7%2.48 / 3.17
JSON schema2,898969969890890+8.9%2.99 / 3.26

Fable 5 ≡ Opus 4.8 and Sonnet 4.6 ≡ Haiku 4.5 byte-for-byte on every sample — two tokenizer families confirmed. The +27.7% on this Rust sample is above the Phase-0 ~15% code figure (different sample) but inside the vendor's "up to 35%" band; structured JSON pays the smallest penalty.

Tool-use overhead — docs' reference request (get_weather tool, "What's the weather like in San Francisco?"):

Modelwith toolno tooldeltadocumented system tax (auto/none)
Fable 540319384absent from pricing table
Opus 4.840319384290
Opus 4.779324769675
Sonnet 4.658716571497
Haiku 4.558616570496

The 403 is an EXACT reproduction of the token-counting page's published example output, and Fable 5 matches Opus 4.8 identically — pinning Fable 5's undocumented tool-use system tax at the lean 290-token Opus 4.8 prompt, not 4.7's 675 (a 57% quiet drop between 4.7 and 4.8).

One negative result: count_tokens REJECTS fabricated thinking-block signatures (400: Invalid signature in thinking block, reproduced on Fable 5 and Sonnet 4.6) — so the prior-turn-thinking billing question cannot be settled with the free harness alone; it needs a real signed transcript (see record 5 and Gaps).

Technique records

1. Prompt caching (incl. automatic mode) — the lever already paying for itself

  • Layer: cache / input.
  • Mechanism: Opt-in cache_control. Automatic mode: one top-level field, system places/moves a single breakpoint on the last cacheable block (Claude API, Claude Platform on AWS, MS Foundry beta — NOT Bedrock/Vertex). Explicit mode: up to 4 breakpoints, 20-block lookback per breakpoint. Writes 1.25x (5m TTL) / 2x (1h); reads 0.1x; "The cache is refreshed for no additional cost each time the cached content is used." Min cacheable prefix (live table): Fable 5/Mythos 5 = 512 (1,024 on Bedrock), Opus 4.8 = 1,024, Opus 4.7 = 2,048, Opus 4.6/4.5 = 4,096, Sonnet 4.6/4.5 = 1,024, Haiku 4.5 = 4,096. Invalidation hierarchy tools→system→messages: tool-definition edits bust everything; tool_choice/images/thinking-parameter changes bust messages only.
  • Expected savings: already realized in the baseline: local heavy session prompt mix is 0.44% uncached / 6.73% cache-write / 92.83% cache-read; reads+writes = 61% of dollars. ESTIMATE counterfactual: 7.02M cache-read tok/day at 1.0x instead of 0.1x = $70.2 vs $7.0 → day total ~$85 vs ~$22 (~3.9x). Pays off after 1 read (5m) or 2 reads (1h) — pricing page's own arithmetic.
  • Evidence tier: T1 (live caching + pricing pages) + local baseline measurement.
  • Quality risk: NEUTRAL — identical model output; pure price arbitrage. Falsification: none needed; verify via usage.cache_read_input_tokens > 0.
  • Availability: CLAUDE-CODE-TODAY (applied automatically; the 92.83% read share is the proof) + SDK for breakpoint/TTL control.
  • Effort to adopt: zero in Claude Code; minutes in API code. 1h TTL only wins when request gaps exceed 5 min — free refresh makes 5m TTL sufficient for active sessions.
  • Composability: multipliers stack with Batch 50% and data-residency 1.1x ("These multipliers stack with other pricing modifiers"). Tension: context editing/compaction invalidate cache at the edit point; defer_loading and stable adaptive thinking preserve it. See 13-caching-exploitation.md.
  • Validation protocol: for any session, assert cache_read_input_tokens / (input + cache_creation + cache_read) > 0.85 across calls 2..N; alert on zero reads with identical prefixes (silent invalidator). Distinguish 5m vs 1h writes via the per-TTL cache_creation breakdown fields in usage.

2. Server-side compaction (compact_20260112) — API-native auto-summarization

  • Layer: turn-structure / context lifecycle.
  • Mechanism: Beta anthropic-beta: compact-2026-01-12, edit {"type": "compact_20260112"} under context_management. Default trigger 150,000 input tokens (minimum 50,000). API emits a compaction block; on later requests everything before it is dropped server-side. Optional pause_after_compaction and custom instructions (REPLACE the default prompt entirely). Billing: summarization is a separate iteration — doc example: compaction 180,000 in + 3,500 out, message 23,000 in + 1,000 out → 203k in + 4.5k out total, while top-level usage shows only 23k/1k (~88% of the turn invisible to naive dashboards). Re-applying an existing compaction block costs nothing. Models: Fable 5, Mythos 5/Preview, Opus 4.8/4.7/4.6, Sonnet 4.6. Summarization runs on the SAME model — "There is no option to use a different (for example, cheaper) model"; with tools defined the model may call a tool instead of summarizing (content: null) unless instructions forbid it.
  • Expected savings: no vendor savings/quality numbers published (T4 for magnitude). ESTIMATE: capping per-turn re-read context at 150k instead of growing toward 1M caps the cache-read line at ~15% of worst case ($0.15 vs $1.00 per Fable 5 turn at 0.1x); on the local profile the cache-read bucket (32% of dollars) scales with retained context.
  • Evidence tier: T1 mechanics/billing (live compaction page); T4 net savings.
  • Quality risk: QUALITY-TRADE — summary is lossy by construction; zero published evals. Degradation: forgotten constraints/decisions after the boundary. Falsification: continue 20 real coding tasks from compacted vs full history; diff task success and re-asked-question rate.
  • Availability: SDK (Messages API beta). NOT a Claude Code knob — Claude Code ships its own client-side auto-compact, a separate implementation with at least one live mis-trigger regression: auto-compact firing at 84k/1M (8% usage) every 1-2 min in v2.1.112 (github.com/anthropics/claude-code/issues/52981).
  • Effort to adopt: hours (one edits entry + append full response.content so compaction blocks persist; track cost via usage.iterations[], populated only on turns that trigger a new compaction).
  • Composability: pairs with memory tool (state survives the boundary); put a cache breakpoint at end of system prompt so compactions don't bust the system prefix; compaction blocks accept cache_control. Each compaction = one cache re-write of the new shorter prefix.
  • Validation protocol: A/B a 500k-token scripted agent run, compaction-at-150k vs raw growth: compare total $ (summed over iterations[]), task completion, and factual-recall probes referencing pre-boundary content.

3. Context editing: clear_tool_uses + clear_thinking — the rare negative-cost lever (vendor-claimed)

  • Layer: turn-structure / context lifecycle.
  • Mechanism: Beta context-management-2025-06-27. clear_tool_uses_20250919: clears oldest tool results past trigger (defaults: 100,000-token trigger, keep 3 most-recent tool uses; optional clear_at_least / exclude_tools / clear_tool_inputs). clear_thinking_20251015: keep last N thinking turns; must be listed FIRST when combined. Applied server-side pre-inference; client keeps full history. count_tokens previews the effect (context_management.original_input_tokens vs post-edit input_tokens).
  • Expected savings: vendor (claude.com/blog/context-management, verbatim): "Context editing alone delivered a 29% improvement"; "combining the memory tool with context editing improved performance by 39%"; 100-turn web-search eval "reducing token consumption by 84%". On the local profile an 84%-class cut applies to the 61%-of-dollars cache block → ESTIMATE up to ~$11/day ceiling if it transferred fully to coding (unproven).
  • Evidence tier: T1 shipped + vendor-published numbers; NO independent replication found — treat magnitude as vendor-internal (web-search eval, not coding).
  • Quality risk: NEGATIVE-COST per vendor (tokens down AND +29% performance — stale tool results are anti-signal); treat as RISKY on coding until replicated, since cleared results are unrecoverable without memory. Falsification: replicate on Read/Grep-heavy coding tasks; check for re-reading of cleared files.
  • Availability: SDK (also Bedrock/Vertex per launch post). Not a Claude Code knob.
  • Effort to adopt: hours — one context_management block; tune clear_at_least so each cache bust buys a big clear.
  • Composability: DIRECT CACHE TENSION (live page): clearing "invalidates cached prompt prefixes"; kept thinking preserves cache. Integrates with memory tool. Anti-synergy with compaction if both fire on the same window — pick one primary strategy (docs position compaction as primary).
  • Validation protocol: fixed 30-task coding suite, edits on/off: compare success rate, $/task (cache-write churn included), and count_tokens previews vs realized billing.

4. Effort parameter + adaptive thinking — the only thinking-spend control on Fable 5

  • Layer: output.
  • Mechanism: output_config: {effort: low|medium|high|xhigh|max}, GA, no beta header; high default ("exactly the same behavior as omitting the parameter"). Affects ALL response tokens: thinking, tool-call count, preamble ("lower effort would mean Claude makes fewer tool calls"). max on Fable 5/Mythos/Opus 4.8/4.7/4.6/Sonnet 4.6; xhigh only Fable 5/Mythos 5/Opus 4.8/4.7. On Fable 5 thinking is always-on ({type:"disabled"} rejected); budget_tokens 400s on Opus 4.8/4.7, deprecated on 4.6/Sonnet 4.6. max_tokens is the hard cap (docs: start 64k at xhigh/max). Claude Code: default high (effort page: Opus 4.8 "default is high on all surfaces, including the Claude API and Claude Code"); "ultracode" is NOT an API level — it is xhigh + standing multi-agent permission injected via mid-conversation system messages. Observability: usage.output_tokens_details.thinking_tokens splits billed reasoning from response (strictly better than the Phase-0 subtraction method). display: omitted/summarized never changes billing.
  • Expected savings: NO published per-level percentages anywhere (confirmed by full-page read) — only qualitative ("low: Significant token savings with some capability reduction"; Sonnet 4.6 recommended default medium). Local anchor: thinking = 54.8% of output tokens at max effort (n=1) and output = 37% of session dollars → ESTIMATE a 50% thinking cut ≈ 10% of session cost (0.5 × 20%), plus second-order input savings from fewer tool calls.
  • Evidence tier: T1 mechanics; T4 magnitude per level (vendor publishes none — measure locally).
  • Quality risk: QUALITY-TRADE below high (docs: "accepting some reduction in capability"; at low/medium Opus 4.7+ "scopes its work to what was asked"). Conversely dropping max→high can be NEGATIVE-COST: docs warn max "can lead to overthinking" on structured-output / less intelligence-sensitive tasks.
  • Availability: CLAUDE-CODE-TODAY (effort menu) + SDK.
  • Effort to adopt: minutes (menu item / one field); the real work is the eval.
  • Composability: consecutive adaptive-mode requests preserve cache; SWITCHING thinking mode busts message cache (system/tools survive). Per-message steering ("Answer directly without deliberating.") works with no parameter change at all → zero cache impact.
  • Validation protocol: fixed Claude Code task set at low/medium/high/xhigh; read output_tokens_details.thinking_tokens per run; report tokens + pass-rate per level. This closes the dossier's biggest missing number and re-bases the 54.8% n=1. See 31-validation-harness.md.

5. Prior-turn thinking retention — the hidden input-side cost of reasoning models

  • Layer: input / context lifecycle.
  • Mechanism: Adaptive-thinking page (verbatim): "whether the API keeps or strips them depends on the model: Opus 4.5+ and Sonnet 4.6+ keep them in context by default; earlier Opus/Sonnet models and all Haiku models strip them"; pricing section bills "Thinking blocks from prior assistant turns kept in context: ... all turns by default on Opus 4.5+ and Sonnet 4.6+ (input tokens)". Reclaim via clear_thinking_20251015 (keep: N). LIVE CONTRADICTION (both pages): context-windows page still says "previous thinking blocks are automatically stripped from the context window calculation by the Claude API"; token-counting page says prior-turn thinking "do not count toward your input tokens" (plausibly true for the counting endpoint specifically, while create keeps and bills — but the context-windows text is flatly stale).
  • Expected savings: ESTIMATE: at the local 54.8% thinking share, kept thinking re-enters every later prompt; at 0.1x cache-read it is cheap per token but compounds the 61%-of-dollars cache block over 50-turn sessions. clear_thinking keep:1-2 reclaims most of it at the cost of one cache bust.
  • Evidence tier: T1 for retention behavior (two live pages agree, two disagree); T4 dollar magnitude. Local probe blocked: count_tokens validates signatures (400 on fabricated blocks, reproduced) — needs a real signed transcript.
  • Quality risk: retention presumably aids reasoning continuity; aggressive clearing is RISKY for long multi-step chains, NEUTRAL for fresh-task-per-turn usage. Exception that can never be cleared: the thinking block of an in-flight tool_use cycle must be returned.
  • Availability: behavior affects CLAUDE-CODE-TODAY sessions implicitly; the clear_thinking control is SDK-only.
  • Effort to adopt: hours (one edits entry) once on the SDK.
  • Composability: kept thinking = stable cache-friendly prefix; clearing busts cache at the clear point. Interacts with effort (less thinking generated → less retained).
  • Validation protocol: capture a real multi-turn Fable 5 transcript (signed blocks); count_tokens it with and without prior thinking passed back AND diff usage.input_tokens on live create calls — resolves the docs contradiction empirically and prices the retained share.

6. Tool search tool + defer_loading — stop paying for unused tool schemas

  • Layer: input (prefix).
  • Mechanism: Two server-side variants, no beta header: tool_search_tool_regex_20251119 (Python re.search, ≤200 chars) and tool_search_tool_bm25_20251119 (natural language). defer_loading: true tools are excluded from the system-prompt prefix; on discovery the API appends a tool_reference block and expands it — verbatim: "The prefix is untouched, so prompt caching is preserved." 3-5 references per search; ≤10,000 tools; expanded references persist across turns. MCP defers via mcp_toolset. Incompatible with tool-use examples; strict-mode grammar builds from the full toolset (no recompilation).
  • Expected savings: docs: typical 5-server MCP setup "~55k tokens in definitions"; tool search "typically reduces this by over 85%"; accuracy "degrades significantly once you exceed 30-50 available tools". Engineering post: 77K→8.7K upfront ("85% reduction... preserving 95% of context window"); MCP-eval accuracy Opus 4 49%→74%, Opus 4.5 79.5%→88.1%; internal worst case 134K tokens of definitions. Local ground truth: 11 MCP schemas = 1,420 tok loaded vs ~60 tok deferred ≈ 96% cut (Phase-0) — direction and magnitude consistent with vendor.
  • Evidence tier: T1 (vendor-published + locally reproduced).
  • Quality risk: NEGATIVE-COST for big catalogs (tokens down, accuracy up); NEUTRAL-to-negative under 10 tools / <100 schema tokens (docs' own threshold) where search round-trips just add latency.
  • Availability: CLAUDE-CODE-TODAY (this session's harness runs 37 deferred tools behind a ToolSearch tool) + SDK.
  • Effort to adopt: minutes-hours: flag per tool + include the search tool; keep the top 3-5 tools non-deferred.
  • Composability: the standout property is cache preservation — most context surgery busts cache, this does not. Works in Batch at normal pricing. Composes with strict mode without grammar recompilation.
  • Validation protocol: measure per-search overhead (server_tool_use block + expanded definitions) vs schemas saved on your catalog: count_tokens the request with all tools loaded vs deferred + N searches; assert net saving and unchanged task pass-rate.

7. Programmatic tool calling (PTC) — tool results that never enter context

  • Layer: input (tool-result volume).
  • Mechanism: Requires code_execution_20260120; tools opt in via allowed_callers: ["code_execution_20260120"] (guidance, "not a security boundary"). Claude writes a script in the sandbox; each in-script tool call pauses execution, client supplies tool_result, script resumes. Verbatim: "Tool results from programmatic invocations do not count toward your input/output token usage. Only the final code execution result and Claude's response count." Platforms: Claude API, Claude Platform on AWS, MS Foundry — NOT Bedrock/Vertex. Exclusions: MCP-connector tools, strict: true tools.
  • Expected savings: vendor (all verbatim): BrowseComp+DeepSearchQA "improved performance by an average of 11% while using 24% fewer input tokens"; 75-tool benchmark "reduced billed input tokens by roughly 38% with no change in task accuracy"; production traffic 10-49 tools "typical token savings of 20% to 40%". NEGATIVE RESULT, same page: τ²-bench "left scores unchanged and cost roughly 8% more. Sequential single-call workflows do not benefit." Engineering post: 43,588→27,297 tok (37%) on complex research.
  • Evidence tier: T1 (multiple vendor benchmarks incl. an honest negative).
  • Quality risk: NEGATIVE-COST on fan-out/filter shapes; RISKY (+8% cost, no gain) on sequential reason-per-call shapes — which is exactly the Read/Edit/Bash profile of most coding-agent turns. Do not assume the 20-40% band transfers to Claude Code-like traffic.
  • Availability: SDK (server-side containers). Not how Claude Code runs tools; Claude Code subagents achieve a similar result-isolation effect client-side.
  • Effort to adopt: days — enable code execution, tag tools, implement the pause/resume protocol.
  • Composability: batch-compatible at batch pricing. Container costs: 1,550 free hours/org/month, then $0.05/hr (5-min minimum), FREE when web_search_20260209/web_fetch_20260209 is in the request. Conflicts with strict tools and MCP connector.
  • Validation protocol: replay a week of real tool traffic shapes through PTC vs direct; segment by fan-out width; adopt only for shapes where billed input drops with flat accuracy (expect the sequential segment to reproduce the τ²-bench negative).

8. Batch API — 50% off everything, stacks with caching

  • Layer: infra / pricing modifier.
  • Mechanism: Async Messages Batches; live table is exactly 50% on input AND output for every active model (Fable 5 $5/$25, Opus 4.8 $2.50/$12.50, Sonnet 4.6 $1.50/$7.50, Haiku 4.5 $0.50/$2.50). FAQ verbatim: "Batch API and prompt caching discounts can be combined"; multipliers stack → batched cache read = 0.5 × 0.1 = 0.05x base input (ESTIMATE arithmetic from stated stacking). 1M requests batch at the same rates; tool search batches at normal pricing. Excluded: fast mode, Managed Agents sessions.
  • Expected savings: exactly 50% of whatever moves to batch — the only lever touching all five local dollar buckets at once, including the 37%-of-dollars output side.
  • Evidence tier: T1 (pricing page). SDK docs: most batches < 1h, max 24h (cached skill reference) — re-verify turnaround on your own workload.
  • Quality risk: NEUTRAL on quality; cost is latency (results within 24h).
  • Availability: SDK only — useless for interactive Claude Code; ideal for nightly review sweeps, eval runs, offline subagent fan-outs.
  • Effort to adopt: days (restructure offline workloads into batch submissions).
  • Composability: stacks with caching and 1M context; mutually exclusive with fast mode and interactivity.
  • Validation protocol: move one nightly workload (e.g. repo-wide code-review sweep) to batch; confirm 0.5x on the invoice line and measure realized turnaround distribution.

9. 1M context at standard pricing — a cost-model correction, not a saving

  • Layer: infra / pricing model.
  • Mechanism: Pricing page verbatim: Fable 5, Mythos 5/Preview, Opus 4.8/4.7/4.6, Sonnet 4.6 "include the full 1M token context window at standard pricing. (A 900k-token request is billed at the same per-token rate as a 9k-token request.)" Caching and batch apply at standard rates across the window. Context-windows page: 1M is the DEFAULT on Fable 5; 128k max output; Opus 4.8 is 200k on MS Foundry only; Sonnet 4.5 and older stay 200k. Overflow on 4.5+: request accepted, generation stops with stop_reason: "model_context_window_exceeded".
  • Expected savings: none directly — kills the 2025-era >200k premium (2x in / 1.5x out) from every old cost model. Flat pricing means bloat is LINEARLY expensive with no cliff: a full 1M Fable 5 prompt = $10 uncached / $1 cached per request (ESTIMATE from list prices).
  • Evidence tier: T1.
  • Quality risk: RISKY if treated as free headroom — the same page warns "more context isn't automatically better. As token count grows, accuracy and recall degrade, a phenomenon known as context rot." Compaction/editing stay worthwhile below the hard limit.
  • Availability: CLAUDE-CODE-TODAY (the [1m] variant / Fable 5 default) + SDK. Bedrock/Vertex 1M listed for Opus 4.8/4.7/4.6/Sonnet 4.6; Fable 5 wording covers the Claude API.
  • Effort to adopt: none (update spreadsheets, not code).
  • Composability: see records 2-3 for the quality counterweight; see 12-context-architecture.md.
  • Validation protocol: n/a for pricing; for quality, recall probes at 100k/300k/600k context on your tasks before relying on deep-window placement.

10. Tokenizer-aware model offload — cheap models also count fewer tokens

  • Layer: infra / model routing.
  • Mechanism: Pricing page: "Opus 4.7 and later use a new tokenizer... may use up to 35% more tokens for the same fixed text"; token-counting page: "roughly 30% more tokens than models before Claude Opus 4.7", with the dual-count migration method. Sonnet 4.6/Haiku 4.5 keep the pre-4.7 tokenizer — locally confirmed identical counts (table above).
  • Expected savings: offloading to Sonnet/Haiku saves twice: lower $/tok AND fewer tokens billed. Measured: +37.9% prose / +27.7% code / +8.9% JSON on Fable 5 for identical text → ESTIMATE effective input-price ratio Fable 5:Sonnet 4.6 ≈ ($10 × 1.28-1.38)/$3 ≈ 4.3-4.6x on prose/code (sticker says 3.3x); Fable 5:Haiku ≈ 13x. Same multiplier applies to context-window occupancy.
  • Evidence tier: T1 (vendor-stated band + locally reproduced).
  • Quality risk: QUALITY-TRADE per task — standard downgrade risk; route only eval-passed task classes.
  • Availability: CLAUDE-CODE-TODAY (Task subagent model field) + SDK routing.
  • Effort to adopt: minutes (subagent model selection) to hours (router rule).
  • Composability: compounds with batch (0.5x) and caching (0.1x) on the cheap model. Caveat: compaction CANNOT delegate summarization to a cheaper model (record 2) — the lever is unavailable exactly where it would shine. Model switch busts cache (caches are model-scoped) — offload at task boundaries, not mid-session.
  • Validation protocol: dual-count your real workload per content type (the docs' own method); then per-task-class A/B pass-rates Sonnet vs Fable before routing.

11. Tool-definition overhead accounting — know the fixed taxes

  • Layer: input (fixed overhead) / observability.
  • Mechanism: Pricing page itemizes: tool-use system prompt tax (auto/none vs any/tool): Opus 4.8 = 290/410, Opus 4.7 = 675/804, Opus 4.6 & Sonnet 4.6 = 497/589, Opus 4.5/Sonnet 4.5/Haiku 4.5 = 496/588 — Fable 5 ABSENT (local 403 reproduction pins it ≈ Opus 4.8's 290). Per-tool adders: bash +245; text_editor_20250429 +700; computer use +735/tool & +466-499 system. Web search $10/1k searches with results billed as input in all later turns; web fetch $0 beyond tokens (use max_content_tokens; docs: ~2,500 tok avg page, ~25k big doc page, ~125k per 500kB PDF).
  • Expected savings: micro per request (all prefix → cached after first write); the real lever is knowing web search has a per-use fee + persistent token tail while web fetch is token-only, and that schema bloat multiplies by every cache re-write. ESTIMATE: a Claude Code-like setup (290 system + 245 bash + 700 editor + ~15 schemas) ≈ 3-6k fixed prefix tokens.
  • Evidence tier: T1 (published table) + local exact reproduction (403).
  • Quality risk: NEUTRAL (accounting, not behavior).
  • Availability: CLAUDE-CODE-TODAY / SDK.
  • Effort to adopt: none — measurement only.
  • Composability: caching absorbs all of it after first write; defer_loading removes the schema share but not the system tax.
  • Validation protocol: count_tokens with/without each tool on your exact toolset (script in this file's method section reproduces it in ~10 API calls, $0).

12. count_tokens — free measurement infrastructure

  • Layer: infra / observability.
  • Mechanism: POST /v1/messages/count_tokens. "Token counting is free to use"; RPM by tier: 1→100, 2→2,000, 3→4,000, 4→8,000; "Token counting and message creation have separate and independent rate limits." Counts are estimates; "You are not billed for system-added tokens." Supports system, tools, images, PDFs, thinking and context_management previews (record 3) — you can price a context-editing config before running it. Caveat found locally: it validates thinking-block signatures (400 on fabricated ones) and per its FAQ never exercises caching logic.
  • Expected savings: zero direct; enables every other measurement. This file verified three doc claims with it at $0.
  • Evidence tier: T1 + local use.
  • Quality risk: NEUTRAL.
  • Availability: SDK (and trivially scriptable from Claude Code, as done here — /tmp/ct.py).
  • Effort to adopt: minutes.
  • Composability: pairs with context_management previews and dual-model tokenizer audits.
  • Validation protocol: n/a — it IS the validation tool; sanity-check against usage on one live call.

13. Structured outputs / strict tool use — retry elimination, not a discount

  • Layer: output (reliability).
  • Mechanism: output_config.format (json_schema) + strict: true tools (old beta header deprecated). Grammar compiles on first request, cached 24h from last use; only schema STRUCTURE changes invalidate it. Limits: 20 strict tools/request; adds a small schema-explanation system injection. Incompatible with PTC (strict tools can't be code-called) and citations.
  • Expected savings: no vendor numbers. ESTIMATE: retry-on-malformed-JSON pipelines spend ≈ 1/(1-p) of base; eliminating p=10% saves ~9% of pipeline spend in 5x-priced output tokens (T4, workload-dependent).
  • Evidence tier: T1 mechanics; T4 magnitude.
  • Quality risk: NEUTRAL-to-positive (guaranteed parseable output).
  • Availability: SDK (Claude Code's own StructuredOutput tool is harness-level, not this feature).
  • Effort to adopt: hours.
  • Composability: composes with defer_loading without grammar recompilation; conflicts with PTC.
  • Validation protocol: measure malformed-output rate and output token count, prompted-JSON vs constrained, on 200 extraction calls.

14. Cross-provider triangulation — OpenAI and Gemini caching baselines

  • Layer: competitive baseline (GATEWAY-OR-SELF-HOST from a Claude Code perspective).
  • Mechanism: OpenAI (developers.openai.com): caching automatic at ≥1,024 tokens, writes free, "can reduce latency by up to 80% and input token costs by up to 90%", in-memory retention 5-10 min up to 1h, extended retention up to 24h at no extra charge, optional prompt_cache_key routing — best-effort, no guaranteed hit rate. Gemini (ai.google.dev): implicit caching default on 2.5+ (min 2,048-4,096 tok), explicit caches billed at ~0.1x reads (3.5 Flash $0.15 vs $1.50; 2.5 Pro $0.125 vs $1.25) PLUS storage $1.00 (Flash) / $4.50 (Pro) per MTok per HOUR; batch 50% — matching Anthropic exactly.
  • Expected savings: No direct Claude Code saving; this is a comparative baseline. Anthropic's model (paid writes, guaranteed 0.1x reads, free TTL refresh, $0 storage) wins for bursty interactive sessions; Gemini's storage-rent model only wins at sustained traffic (ESTIMATE: a 500k-token explicit cache on a Pro-class model costs $2.25/hr in rent before any read); OpenAI is cheapest when its best-effort hit rate cooperates. Convergence: 0.1x-class reads and 50% batch discounts look like settled industry constants as of mid-2026.
  • Evidence tier: T1.
  • Quality risk: NEUTRAL.
  • Availability: different providers.
  • Effort to adopt: n/a.
  • Composability: n/a — context for negotiation/architecture choices.
  • Validation protocol: n/a (baseline only; not a recommendation to switch).

Claims to kill (folklore checked against live sources)

  • KILL "1M context costs a premium above 200k" — every current 1M model bills flat ("900k... same per-token rate as a 9k-token request"). The premium was real only for the 2025 context-1m-2025-08-07 beta era.
  • KILL "token-efficient-tools-2025-02-19 header cuts tool output up to 70%" — migration guide: the header "has no effect" on Claude 4+; built in.
  • KILL "hide/omit thinking to save money" — "Omitting reduces latency, not cost"; "Under every setting, thinking happens and is billed the same." Corollary: visible thinking tokens ≠ billed thinking tokens; use output_tokens_details.thinking_tokens.
  • KILL (nuanced) "prior-turn thinking is always stripped, so reasoning never costs input" — false on Opus 4.5+/Sonnet 4.6+/Fable 5 (kept, billed as input). Two Anthropic pages (context-windows, token-counting) still propagate the dead claim.
  • KILL "cap Fable 5 thinking with budget_tokens / disable thinking" — disabled rejected on Fable 5; budget_tokens 400s on Opus 4.8/4.7. Only effort, prompt steering, max_tokens remain.
  • KILL "batch doesn't combine with caching" — FAQ: "Batch API and prompt caching discounts can be combined" (≈0.05x batched cache read).
  • KILL "PTC always saves tokens" — vendor's own τ²-bench result: "cost roughly 8% more" on sequential workflows.
  • KILL "Anthropic caching needs manual breakpoints" — automatic top-level mode exists (Claude API / Claude Platform on AWS / MS Foundry beta; still NOT Bedrock/Vertex).
  • KILL "Anthropic caching needs 1,024+ token prefixes" — wrong in both directions: Fable 5 = 512, but Haiku 4.5 and Opus 4.6/4.5 = 4,096.
  • KILL "OpenAI caching = 50% off" — current guide says up to 90%; 50% is the GPT-4o-era figure.
  • FLAG (contradicts local measurement): pricing-FAQ heuristic "1 token ≈ 4 characters" of English — measured 2.81 chars/tok on Fable 5 markdown prose (3.88 on Sonnet 4.6). Roughly right for the old tokenizer, undercounts Fable 5 token volume by ~30-40% on prose.

Gaps — ranked next measurements

  1. Per-effort-level token deltas (the only lever on 20% of dollars; vendor publishes nothing). Cheap: fixed task set × 4 levels, read output_tokens_details.thinking_tokens.
  2. Prior-turn thinking retention billing — docs contradict; local probe blocked by signature validation. Needs one real signed transcript diffed via count_tokens + live usage.input_tokens.
  3. Compaction quality/savings — zero published numbers, same-model-only summarization, and Claude Code's separate client-side auto-compact has a live mis-trigger regression (issue #52981). Nobody has benchmarked server-compact-at-150k vs Claude Code auto-compact.
  4. Official Fable 5 row in the tool-overhead table — vendor omission; local 403 exact-match is the best current evidence.
  5. Claude Code exposure gap — effort and deferred MCP tools are surfaced; context_editing, clear_thinking, compact_20260112, memory have no Claude Code knobs, so the strongest published-number levers (84%) are SDK-only today.
  6. Independent replication — ALL headline percentages (84/39/29, >85, 38/24, 20-40) are vendor self-reported on internal evals; the τ²-bench negative result raises confidence in honesty, but no third-party replication surfaced in this sweep (T1-vendor, not T3).
  7. Fast mode as a spend-more lever (Opus 4.8 at $10/$50 — Fable 5 sticker for an Opus model; Opus 4.6/4.7 at $30/$150 = 6x; research preview, no batch, stacks with caching) deserves a quality-per-dollar comparison vs Fable 5 at high effort.

Verification ledger

Number / claimSource or local method
Fable 5 $10/$50; Opus 4.8 $5/$25; Sonnet 4.6 $3/$15; Haiku 4.5 $1/$5; cache write $12.50/5m, $20/1h, read $1 (Fable 5)platform.claude.com/docs/en/about-claude/pricing
Cache multipliers 1.25x/2x/0.1x; pays off after 1 read (5m) / 2 reads (1h); "multipliers stack"; batch+cache "can be combined"same pricing page
Batch = exactly 50% per model (Fable 5 $5/$25 ... Haiku $0.50/$2.50); fast-mode and Managed-Agents exclusionssame pricing page
1M standard pricing + "900k-token request ... same per-token rate as a 9k-token request"same pricing page (Long context section)
Fast mode: Opus 4.8 $10/$50; Opus 4.6/4.7 $30/$150; research preview; stacks with caching/data-residency; no batchsame pricing page
Tool-use system tax table (290/410, 675/804, 497/589, 496/588; Fable 5 absent); bash +245; text editor +700; computer use 735 & 466-499; web search $10/1k; web fetch $0, ~2,500/~25k/~125k tok examplessame pricing page
Code execution 1,550 free hrs/org/mo, $0.05/hr, 5-min minimum, free with web search/fetch; data-residency 1.1x; "1 token ≈ 4 characters" FAQsame pricing page
Min cacheable prefix: Fable 5/Mythos 5 = 512 (Bedrock 1,024); Opus 4.8 = 1,024; 4.7 = 2,048; 4.6/4.5 = 4,096; Sonnet 4.6/4.5 = 1,024; Haiku 4.5 = 4,096; 4 breakpoints; 20-block lookback; automatic mode platforms; free TTL refresh; invalidation hierarchyplatform.claude.com/docs/en/build-with-claude/prompt-caching
Compaction: beta compact-2026-01-12, type compact_20260112, 150k default / 50k min, iterations example 180k+3.5k & 23k+1k (203k in / 4.5k out vs top-level 23k/1k), same-model-only, content:null tool failure, re-apply freeplatform.claude.com/docs/en/build-with-claude/compaction
Context editing: types/dates, beta context-management-2025-06-27, 100k trigger / keep 3, cache invalidation warning, count_tokens preview (original_input_tokens), clear_thinking-first ordering, retention defaultsplatform.claude.com/docs/en/build-with-claude/context-editing
29% / 39% / 84%claude.com/blog/context-management (verbatim)
Thinking kept+billed on Opus 4.5+/Sonnet 4.6+ ("keep them in context by default", "(input tokens)"); "Omitting reduces latency, not cost"; "billed the same"; output_tokens_details.thinking_tokens; adaptive-mode cache preservation; Fable 5 always-on thinking; budget_tokens 400splatform.claude.com/docs/en/build-with-claude/adaptive-thinking
Stale "automatically stripped" text; 1M default on Fable 5; 128k max output; Foundry 200k for Opus 4.8; context-rot warning; overflow stop_reason; compaction-as-primary positioningplatform.claude.com/docs/en/build-with-claude/context-windows
count_tokens free; RPM 100/2,000/4,000/8,000; separate limits; estimates + system-added tokens unbilled; prior-thinking "do not count" note; "roughly 30% more tokens" + dual-count method; 403 reference outputplatform.claude.com/docs/en/build-with-claude/token-counting
Tool search: ~55k example, ">over 85%" reduction, 30-50 tool degradation, "prefix is untouched, so prompt caching is preserved", 10,000 max, 3-5 refs, <10-tools guidance, no beta headerplatform.claude.com/docs/en/agents-and-tools/tool-use/tool-search-tool
PTC: +11%/-24% (BrowseComp+DeepSearchQA), -38% (75-tool), 20-40% production band, τ²-bench "roughly 8% more", billing exclusion quote, platform matrix, strict/MCP exclusionsplatform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling
77K→8.7K (85%, 95% of context preserved), 49%→74%, 79.5%→88.1%, 134K internal worst case, 43,588→27,297 (37%)anthropic.com/engineering/advanced-tool-use
Effort: levels, high default = omitting, low "Significant token savings with some capability reduction", affects all tokens/tool calls, Sonnet 4.6 medium recommendation, Opus 4.8 "high on all surfaces", ultracode = xhigh + multi-agent permission, max overthinking warning, 64k max_tokens guidance, NO per-level percentagesplatform.claude.com/docs/en/build-with-claude/effort
Structured outputs: output_config.format, 24h grammar cache, structure-only invalidation, 20 strict tools, deprecated beta headerplatform.claude.com/docs/en/build-with-claude/structured-outputs (sweep verification)
token-efficient-tools / output-128k headers "have no effect" on Claude 4+platform.claude.com migration guide (sweep verification)
Claude Code auto-compact misfire at 84k/1M, every 1-2 min, regression v2.1.112github.com/anthropics/claude-code/issues/52981
OpenAI: automatic ≥1,024 tok, free writes, "up to 90%" input cost / "up to 80%" latency, 5-10min→1h + 24h extended retention free, prompt_cache_keydevelopers.openai.com/api/docs/guides/prompt-caching (sweep verification)
Gemini: implicit default on 2.5+ (2,048/4,096 min), 0.1x explicit reads ($0.125-$0.20 vs $1.25-$2.00), storage $1.00/$4.50 per MTok/hr, batch 50%ai.google.dev/gemini-api/docs/caching + /pricing (sweep verification)
Tokenizer deltas +37.9% prose / +27.7% Rust / +8.9% JSON; Fable 5 ≡ Opus 4.8; Sonnet 4.6 ≡ Haiku 4.5; 2.81 vs 3.88 chars/toklocal measurement, /tmp/ct.py vs count_tokens API (samples: repo README/stories.rs/8x get_weather JSON)
Tool overhead: 403/19 (Fable 5 & Opus 4.8), 793/24 (4.7), 587/16 (Sonnet 4.6), 586/16 (Haiku 4.5)local measurement, /tmp/ct.py with --tools (exact match to docs' 403 example)
count_tokens rejects fabricated thinking signatures (400)local measurement, /tmp/ct.py
54.8% thinking share; 0.44/6.73/92.83% prompt mix; 32/29/20/17/2% dollar split; ~$22/day; 1,420 vs ~60 tok MCP schemas; caveman/wenyan cutslocal Phase-0 measurements (see 01-economics-and-measurement.md, 02-baseline-audit.md)
~$85 vs ~$22 no-cache counterfactual; 0.05x batched cache read; 4.3-4.6x effective Fable:Sonnet ratio; 150k-vs-1M ≈ 85% cap; 10% session cut per 50% thinking reductionESTIMATE — arithmetic shown inline on the modeled session profile

On this page