# 16 — Model routing and tiered delegation (https://jackin.tailrocks.com/research/token-optimization/16-model-routing-and-delegation/)



# 16 — Model routing and tiered delegation [#16--model-routing-and-tiered-delegation]

**TL;DR**

* Routing is a shipped, in-product Claude Code surface with four mechanisms: per-subagent `model:` override, the built-in Explore subagent (already pinned to Haiku), the `opusplan` plan/execute alias, and the experimental advisor tool. Only the advisor has Anthropic-published end-to-end numbers: **Sonnet-main + Opus-advisor = +2.7pp SWE-bench Multilingual AND −11.9% cost per agentic task vs Sonnet alone** (T1) — escalation measured as NEGATIVE-COST.
* Tokenizer divergence multiplies downtiers only for content that actually has the premium. Fable 5/Opus 4.8 bill "roughly 30% more tokens" on English/ASCII-heavy text, but probes found code/CJK near-neutral. Fable→Haiku is **10x on code/CJK*&#x2A; and **\~13–14.5x on prose/markdown-heavy text**; Sonnet 4.6 and Haiku 4.5 tokenize **identically**, so that hop is the 3x price ratio only.
* Subagents are NOT automatically cheap: the documented default is `model: inherit`. One frontmatter line (`model: haiku`) turns a 5-worker exploration fan-out from $2.75 (all-Fable) to \~$0.19–0.28 depending content mix — a **90–93% cut** (ESTIMATE, arithmetic shown below).
* Route at task boundaries: a mid-session `/model` or `/effort&#x60; change invalidates the prompt cache and "re-reads the full history without cached context" (T1). ESTIMATE: on a 150K-token history the switch costs \~$0.43 vs $0.075 staying cached — **\~9 turns to break even**. Cache-safe channels: subagents, advisor, forks, opusplan-at-the-plan-boundary.
* Cost-control side: agent teams use **\~7x** the tokens of a standard session (plan-mode teammates, T1); use Sonnet — not Haiku — for teammates. External routers are weaker than they sound: RouteLLM's famous "85%" is MT-Bench-only (45% MMLU / 35% GSM8K), Martian's "20–97%" has no public methodology, and per-request gateway routing breaks Claude's model-scoped prompt cache.

## Verified price and capability baseline (all fetched) [#verified-price-and-capability-baseline-all-fetched]

| Model      | API ID              | $/MTok in / out | Context           | Max output | Tokenizer       | Quality anchor                                                                                                                      |
| ---------- | ------------------- | --------------- | ----------------- | ---------- | --------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
| Fable 5    | `claude-fable-5`    | $10 / $50       | 1M                | 128k       | new (Opus 4.7+) | "tasks larger than a single sitting" (T1 docs framing); "95% SWE-bench Verified" is a T3 aggregator headline, UNVERIFIED vs primary |
| Opus 4.8   | `claude-opus-4-8`   | $5 / $25        | 1M (\~555k words) | 128k       | new (Opus 4.7+) | Opus 4.6 = 80.8% SWE-bench Verified (T3 aggregator)                                                                                 |
| Sonnet 4.6 | `claude-sonnet-4-6` | $3 / $15        | 1M (\~750k words) | 64k        | old             | 79.6% SWE-bench Verified (T3); preferred over Opus 4.5 by 59% of Claude Code users (T1, Anthropic)                                  |
| Haiku 4.5  | `claude-haiku-4-5`  | $1 / $5         | **200k**          | 64k        | old             | 73.3% SWE-bench Verified (T1, Anthropic, 50-trial avg, 128K thinking budget)                                                        |

Pricing matches the dossier reference sheet exactly (platform.claude.com models overview). The 80.8/79.6 pair is a third-party transcription of Anthropic's announcement table (the table renders as an image) — T3 until checked against the model card. Note Haiku's 200k context: you cannot downtier a subagent whose corpus exceeds \~200k Haiku tokens.

## Local measurement: tokenizer divergence across routing tiers [#local-measurement-tokenizer-divergence-across-routing-tiers]

Method: deterministic samples piped through the free count\_tokens harness (`python3 /tmp/ct.py &lt;model&gt; < sample`), one user message per call. Samples: first 2,000 B of `crates/jackin-core/src/agent.rs`; first 4,096 B and full 6,366 B of this repo's <RepoFile path="AGENTS.md">AGENTS.md</RepoFile>; a fixed 528 B English paragraph; a 558 B JSON tool schema. Counts include a few tokens of message-wrapper overhead (ratios are marginally understated).

| Sample                                                      | Fable 5 | Opus 4.8 | Sonnet 4.6 | Haiku 4.5 | Fable/Sonnet ratio |
| ----------------------------------------------------------- | ------- | -------- | ---------- | --------- | ------------------ |
| Rust code, 2.0 KB                                           | 830     | 830      | 629        | 629       | 1.320x (+32.0%)    |
| <RepoFile path="AGENTS.md">AGENTS.md</RepoFile> first 4 KB  | 1,724   | 1,724    | 1,205      | 1,205     | 1.431x (+43.1%)    |
| <RepoFile path="AGENTS.md">AGENTS.md</RepoFile> full 6.4 KB | 2,744   | 2,744    | 1,919      | 1,919     | 1.430x (+43.0%)    |
| English prose, 528 B                                        | 155     | 155      | 107        | 107       | 1.449x (+44.9%)    |
| JSON tool schema, 558 B                                     | 183     | 183      | 126        | 126       | 1.452x (+45.2%)    |

Findings (all local measurement):

* Fable 5 ≡ Opus 4.8 tokenizer (5/5 identical) and Sonnet 4.6 ≡ Haiku 4.5 (5/5 identical). The divergence exists only across the Opus-4.7 tokenizer boundary.
* **The premium is not universal:** prose measures +35%, but code −3% and CJK −4%; file 11's wider battery found Python +15.6%, minified JSON +34%, English prose +59%, and SCREAMING\_SNAKE +132%. Treat the Rust rows above as sample-specific, not a code-wide rule. Use per-model `count_tokens` on your own corpus; never reuse one token count across tiers.
* Phase-0's root <RepoFile path="AGENTS.md">AGENTS.md</RepoFile> measurement (2,738 tok) reproduces as 2,744 on Fable (within 0.2%; wrapper overhead/file drift). The same always-on file is only 1,919 tokens on Sonnet/Haiku — repo instructions are 43% more expensive in tokens on Fable-tier models before the price ratio even applies.

Effective cost multiplier when moving a corpus DOWN a tier = price ratio × token ratio:

| Route                  | Price ratio | × measured token ratio               | Effective (code → prose)                 |
| ---------------------- | ----------- | ------------------------------------ | ---------------------------------------- |
| Fable 5 → Haiku 4.5    | 10x         | 1.0 code/CJK; \~1.3–1.45 prose/ASCII | **10x code/CJK; \~13–14.5x prose/ASCII** |
| Fable 5 → Sonnet 4.6   | 3.33x       | 1.0 code/CJK; \~1.3 prose/ASCII      | **3.33x code/CJK; \~4.3x prose/ASCII**   |
| Opus 4.8 → Haiku 4.5   | 5x          | 1.0 code/CJK; \~1.3–1.45 prose/ASCII | **5x code/CJK; \~6.5–7.3x prose/ASCII**  |
| Sonnet 4.6 → Haiku 4.5 | 3x          | 1.00 (identical)                     | 3.0x                                     |

## Routing surface map (what ships today) [#routing-surface-map-what-ships-today]

| Mechanism                                                                     | Direction                       | Cache impact                         | Availability                           |
| ----------------------------------------------------------------------------- | ------------------------------- | ------------------------------------ | -------------------------------------- |
| Subagent `model:` frontmatter / per-invocation / `CLAUDE_CODE_SUBAGENT_MODEL` | delegate DOWN for volume        | parent cache untouched               | CLAUDE-CODE-TODAY                      |
| Built-in Explore subagent                                                     | down (Haiku, read-only)         | parent cache untouched               | CLAUDE-CODE-TODAY, default-on          |
| `/model opusplan`                                                             | Opus plan → Sonnet execute      | one bounded switch per plan cycle    | CLAUDE-CODE-TODAY                      |
| Advisor tool (`/advisor`)                                                     | escalate UP at decision points  | explicitly cache-safe to toggle      | CLAUDE-CODE-TODAY (experimental) / SDK |
| `/effort`, `effort:` frontmatter                                              | intra-model tiering             | mid-session change invalidates cache | CLAUDE-CODE-TODAY / SDK                |
| `--fallback-model`, Fable classifier fallback, `availableModels`              | availability/content/governance | each switch is cache-cold on target  | CLAUDE-CODE-TODAY                      |
| RouteLLM / OpenRouter auto / Martian                                          | per-request difficulty routing  | breaks model-scoped caching          | GATEWAY-OR-SELF-HOST                   |

## Techniques [#techniques]

### 1. Per-subagent model downtiering (`model: haiku` frontmatter / per-invocation / `CLAUDE_CODE_SUBAGENT_MODEL`) [#1-per-subagent-model-downtiering-model-haiku-frontmatter--per-invocation--claude_code_subagent_model]

Run exploration, log-grinding, doc-fetching, and review subagents on Haiku/Sonnet under a Fable/Opus main thread; the verbose corpus is isolated from parent context AND billed 3–14.5x cheaper depending model pair and content mix.

* **Layer:** turn-structure + infra (subagent config)
* **Mechanism:** model resolution order (T1, sub-agents docs): (1) `CLAUDE_CODE_SUBAGENT_MODEL` env var — note it covers "all subagents and agent teams" and overrides everything; `inherit` restores normal resolution; (2) per-invocation `model` parameter Claude passes; (3) frontmatter `model:` (`sonnet`/`opus`/`haiku`/`fable`, full ID, or `inherit`); (4) main conversation model. &#x2A;*Default is `inherit` — subagents are NOT automatically cheap.** Costs page: "For simple subagent tasks, specify `model: haiku`"; sub-agents page lists "Control costs by routing tasks to faster, cheaper models like Haiku" as a core benefit. Frontmatter also takes `effort:` (overrides session effort while active). Forked subagents (`CLAUDE_CODE_FORK_SUBAGENT=1`) reuse the parent's prompt cache but run "Same as main session" model — forking and downtiering are mutually exclusive per task.
* **Expected savings:*&#x2A; ESTIMATE on a 5-worker exploration fan-out, each 40K in / 3K out in Fable tokens: all-Fable = 5×(40K×$10 + 3K×$50)/1M = **$2.75*&#x2A;. Haiku with no tokenizer premium = **$0.275*&#x2A; (−90.0%); Haiku at the official/prose 1.3x divergence = $0.21 (−92.3%); repo-markdown-like 1.43x = &#x2A;*$0.19 (−93.0%)**. On the modeled heavy day ($22, Fable): two such fan-outs previously done inline at parent rates ≈ $5.50 → $0.38–0.55, \~22–23% off the day IF that work was being done at all (the context-isolation win compounds it; see 12-context-architecture.md).
* **Evidence tier:** T1 for every mechanism (live docs); savings arithmetic is ESTIMATE with assumptions stated.
* **Quality risk:** NEGATIVE-COST for read-only exploration/summarization — worker prose quality barely matters and main-context hygiene improves. RISKY if the cheap model writes code: Haiku 4.5 is 73.3% SWE-bench Verified vs \~79.6% Sonnet 4.6; failed edits manifest as retry loops in the parent that eat the savings. Degradation signature: subagent summaries that miss the target file or hallucinate paths. Falsification: blind-rate 20 paired summaries (inherit vs haiku); if raters detect quality loss on read-only tasks, the NEGATIVE-COST verdict dies. Also constrained by Haiku's 200k context.
* **Availability:** CLAUDE-CODE-TODAY
* **Effort to adopt:** minutes — one frontmatter line per agent file, or one env var for a blanket policy.
* **Composability:** stacks with `effort: low` frontmatter, parent prompt caching (untouched), and caveman-style output compression (15-output-discipline.md; the cavecrew agents in this repo compose both). Anti-synergy: fork mode (inherits parent model); `CLAUDE_CODE_SUBAGENT_MODEL` is global, so it also downtiers subagents you wanted strong.
* **Validation protocol:** 20 representative exploration prompts × {`model: inherit`, `model: haiku`} on this repo; record per-run cost from Console per-model usage, blind-rate the returned summaries (did the parent get what it needed, 0–2 scale), count parent follow-up turns triggered. Accept if cost cut ≥80% and neither rating nor follow-up count regresses.

### 2. Built-in Explore subagent on Haiku (default-on cheap routing you already have) [#2-built-in-explore-subagent-on-haiku-default-on-cheap-routing-you-already-have]

Claude Code already routes codebase search to Haiku with zero configuration — so don't build a custom cheap-search agent; learn to trigger this one.

* **Layer:** turn-structure (built-in)
* **Mechanism:** Explore is documented as "**Model**: Haiku (fast, low-latency)", read-only tools (Write/Edit denied). "Explore and Plan skip your CLAUDE.md files and the parent session's git status to keep research fast and inexpensive" — every other subagent loads both. (Correction to earlier sweep notes: Plan also skips CLAUDE.md, but Plan inherits the main model.) Claude picks a thoroughness level per invocation (quick/medium/very thorough) — effort-like tiering inside the cheap tier. Separately, background functionality (`--resume` summarization etc.) runs on the `ANTHROPIC_DEFAULT_HAIKU_MODEL` alias and costs "typically under $0.04 per session" (T1, costs page).
* **Expected savings:** unquantified by Anthropic. The lever: exploration corpora bill at $1/$5 instead of $10/$50 and never enter the main context. On the modeled profile, prompt-side tokens are 92.8% cache reads (local phase-0); Explore keeps new exploration off that meter entirely and onto Haiku's.
* **Evidence tier:** T1 (live sub-agents + costs pages).
* **Quality risk:** NEGATIVE-COST — read-only Haiku exploration is the canonical safe downtier; the docs frame it as cost+speed with no capability caveat. Degradation signature: Explore returns wrong/missing files and the main model re-searches inline. Falsification: `permissions.deny Agent(Explore)` for a week and compare session costs + re-search frequency.
* **Availability:** CLAUDE-CODE-TODAY (default-on)
* **Effort to adopt:** zero. To lean on it: phrase requests so search precedes edits ("find where X is handled, then…").
* **Composability:** template for technique 1; composes with everything. Anti-synergy: none, but it skips CLAUDE.md — repo-specific search conventions in <RepoFile path="AGENTS.md">AGENTS.md</RepoFile> don't reach it.
* **Validation protocol:** 10 search tasks with Explore enabled vs denied; measure total session cost (/usage) and whether the main thread's first edit touches the right file. Accept the default if enabled is cheaper with equal first-edit accuracy.

### 3. `opusplan`: plan-on-Opus, execute-on-Sonnet alias [#3-opusplan-plan-on-opus-execute-on-sonnet-alias]

One alias gives frontier planning and mid-tier execution, switching exactly at the plan boundary — the simplest shipped expensive-drafts/cheap-implements split.

* **Layer:** turn-structure (/model alias)
* **Mechanism:** documented as "Special mode that uses `opus` during plan mode, then switches to `sonnet` for execution" (T1, model-config). Plan-mode research additionally delegates to the read-only Plan subagent. Caveats verified live: the plan-mode Opus phase "runs with the standard 200K context window" (the automatic 1M upgrade applies to plain `opus`, not `opusplan`); on the Anthropic API `opus`→Opus 4.8 and `sonnet`→Sonnet 4.6 (Claude Platform on AWS: Opus 4.7; Bedrock/Vertex/Foundry: Opus 4.6/Sonnet 4.5 unless pinned via `ANTHROPIC_DEFAULT_*_MODEL`).
* **Expected savings:** ESTIMATE — execution-phase tokens (edits, tool calls, test loops; usually the bulk) bill at $3/$15 instead of $5/$25 = 40% per-token cut on that phase; output-heavy phases save most since output is 5x input price. If \~70% of an Opus-main session's tokens are execution-phase (ESTIMATE), end-to-end ≈ −28% vs all-Opus. No Anthropic-published end-to-end number. Vs an all-Sonnet baseline it costs MORE (the Opus plan phase) — this is a saver only for Opus-default users.
* **Evidence tier:** T1 mechanics; ESTIMATE savings.
* **Quality risk:** NEUTRAL — planning stays on Opus; execution lands on Sonnet 4.6, which Claude Code users preferred over Opus 4.5 59% of the time (T1, Anthropic) and trails Opus 4.6 by \~1.2pp SWE-bench (T3). Degradation signature: execution-phase failures on subtle refactors that all-Opus would have caught. Falsification: same 10 tasks `/model opus` vs `/model opusplan`, compare review-pass rate.
* **Availability:** CLAUDE-CODE-TODAY
* **Effort to adopt:** minutes — `/model opusplan`.
* **Composability:** composes with plan mode (costs page recommends it to prevent "expensive re-work when the initial direction is wrong"), effort medium on the Sonnet phase, and subagent downtiering. The model switch at the boundary invalidates cache once per plan cycle — bounded, predictable (see technique 9). Anti-synergy: not for Fable users (no fableplan alias exists).
* **Validation protocol:** 10 matched tasks under `opus` vs `opusplan`; track $ via Console per-model breakdown and PR review-pass rate. Accept if cost drops ≥20% with equal pass rate.

### 4. Advisor tool: cheap executor + strong advisor at decision points (measured NEGATIVE-COST) [#4-advisor-tool-cheap-executor--strong-advisor-at-decision-points-measured-negative-cost]

Invert the orchestrator: run the session on Sonnet/Haiku and let it consult Opus/Fable only at decision points — Anthropic measured the Sonnet+Opus pairing as both cheaper and better than Sonnet alone.

* **Layer:** turn-structure (server-side tool)
* **Mechanism:** experimental server tool (Claude Code v2.1.98+; `/advisor`, `advisorModel` setting, `--advisor` flag; Anthropic API only — not Bedrock/Vertex/Foundry). Claude decides when to consult — typically before committing to an approach, on recurring errors, before declaring done; "There is no setting to cap or force advisor calls." The advisor "receives the full conversation" and bills at the advisor model's rates. Pairing matrix enforced by the API: Haiku/Sonnet mains accept Fable/Opus/Sonnet advisors; Opus 4.6+ accepts Fable or same-or-newer Opus; **Fable main accepts only Fable** (v2.1.170+). Cache mechanics (verified verbatim, advisor docs): "Unlike changing model or effort level, toggling `/advisor` keeps the cached prefix intact"; but "The advisor model's own read of the conversation is not cached. Each advisor call processes the full transcript anew."
* **Expected savings:** Anthropic-published (claude.com/blog/the-advisor-strategy): Sonnet+Opus-advisor = "**2.7 percentage point increase on SWE-bench Multilingual** over Sonnet alone, while &#x2A;*reducing cost per agentic task by 11.9%**"; Haiku+Opus-advisor = BrowseComp &#x2A;*41.2% vs 19.7%** Haiku solo, and "trails Sonnet solo by &#x2A;*29%** in score but costs **85% less*&#x2A; per task". Consult tax ESTIMATE: 150K-token transcript × $5/M = **$0.75 per Opus consult**, uncached every time — fine at a few consults, material if chatty. On the modeled Fable-main day this technique cannot cut costs (Fable-only advisor); it argues for a Sonnet-main day instead.
* **Evidence tier:** T1 — vendor-published eval (n and harness not in the post) + live tool docs.
* **Quality risk:** NEGATIVE-COST at Sonnet+Opus (measured: quality up, cost down). **QUALITY-TRADE** at Haiku+Opus (−29% score vs Sonnet solo for −85% cost). Experimental: "Behavior, pricing, and availability may change." Degradation signature: advisor consults so frequent that uncached transcript reads exceed the saved wasted-exploration; or Claude ignoring advisor guidance. Falsification: replicate on your backlog (protocol below) — if cost/task rises vs solo, the published result didn't transfer.
* **Availability:** CLAUDE-CODE-TODAY (experimental) / SDK (`advisor_20260301` + beta header)
* **Effort to adopt:** minutes — `/advisor opus` on a Sonnet main session.
* **Composability:** explicitly cache-safe, so composes with all caching techniques (13-caching-exploitation.md). Complements technique 1: advisor = escalate UP at decision points; subagents = delegate DOWN for volume. Anti-synergy: Fable-main sessions; very long transcripts (consult tax grows linearly, uncached).
* **Validation protocol:** 20 matched tasks, Sonnet-solo vs Sonnet+Opus-advisor; per-task $ from Console (split by model), task success = tests pass + review accept; also log consult count per task. Accept if cost delta ≤0 and success ≥ solo, replicating the −11.9%/+2.7pp directionally.

### 5. Effort-level routing as intra-model tiering (`/effort`, `effort:` frontmatter) [#5-effort-level-routing-as-intra-model-tiering-effort-effort-frontmatter]

Before changing models, change effort: the same model at lower effort thinks less, preambles less, and makes fewer tool calls — and thinking is 54.8% of output tokens at max effort (local phase-0), billed at 5x input price.

* **Layer:** output (+ turn-structure via tool-call count)
* **Mechanism:** `output_config.effort` is a behavioral signal affecting ALL tokens, not a strict budget. Levels (model-config): Fable 5/Opus 4.8/4.7 = low/medium/high/xhigh/max; Opus 4.6/Sonnet 4.6 = low/medium/high/max. Default `high` everywhere except Opus 4.7 (`xhigh`). Set via `/effort`, `--effort`, `CLAUDE_CODE_EFFORT_LEVEL`, `effortLevel`, or `effort:` frontmatter in a skill/subagent (overrides session level while active). API effort docs map `low&#x60; to delegation ("significant token savings with some capability reduction… such as subagents") and recommend **`medium` for Sonnet 4.6** as "best balance… for most applications" even though the shipped default is `high` — a sanctioned, off-by-default saver. Traps: (1) "The effort scale is calibrated per model" — level names don't transfer; (2) **changing effort mid-session invalidates the prompt cache** (advisor page, verbatim above — same as a model switch); (3) `ultrathink` is an in-context request only, "the effort level sent to the API is unchanged" ("think hard" etc. are NOT recognized keywords); `ultracode` = xhigh + dynamic-workflow orchestration, session-only.
* **Expected savings:** Anthropic publishes NO per-level percentages — only "significant token savings" at low. Bounding on the modeled day: thinking 20% + visible output 17% = 37% of spend is effort-addressable ($8.1 of $22), plus second-order input savings from fewer tool round-trips. ESTIMATE: high→medium on Sonnet plausibly cuts 10–25% of output tokens; the validation protocol below is the only way to get a real number (this is the biggest quantification hole in the area).
* **Evidence tier:** T1 mechanics; T4/ESTIMATE magnitudes; thinking share is local measurement (phase-0).
* **Quality risk:** NEUTRAL at medium for Sonnet 4.6 (Anthropic's own recommendation). **QUALITY-TRADE** at low on intelligence-sensitive work ("some capability reduction"; docs reserve low for "short, scoped, latency-sensitive tasks that are not intelligence-sensitive"). Degradation signature: under-thought edits, skipped verification. Falsification: fixed task set at high vs medium; if pass rate drops at medium, the NEUTRAL verdict dies for your workload.
* **Availability:** CLAUDE-CODE-TODAY / SDK
* **Effort to adopt:** minutes — one command or frontmatter line.
* **Composability:** orthogonal to model choice — `model: haiku` + `effort: low` in one subagent frontmatter is the maximum-downtier worker. Anti-synergy: mid-session flips (cache invalidation — batch with `/clear` or task switches, technique 9).
* **Validation protocol:** 15 fixed tasks × {high, medium, low} on Sonnet 4.6; record output tokens, tool-call count, tests-pass rate. Publish the per-level % — it would be the first public number (see 31-validation-harness.md).

### 6. Tokenizer-divergence double saving when crossing the Opus-4.7 boundary [#6-tokenizer-divergence-double-saving-when-crossing-the-opus-47-boundary]

Downtiering from Fable 5/Opus 4.8 saves more than the price sheet says; uptiering to them costs more.

* **Layer:** infra (billing accounting; affects all routing math)
* **Mechanism:** Fable 5/Opus 4.8 use the Opus-4.7 tokenizer. Official tooltip (models overview): "the same text produces roughly 30% more tokens"; context tooltips encode 1.35x (\~555k vs \~750k words per 1M). Local and follow-up measurements show the premium is content-shaped: strong for prose/ASCII identifiers, near-neutral for code/CJK anchors.
* **Expected savings:** multiplies cross-boundary downtiers only on premium-heavy content: Fable→Haiku = 10x on code/CJK and \~13–14.5x on prose/ASCII; Fable→Sonnet = 3.33x code/CJK and \~4.3x prose/ASCII; Opus 4.8→Haiku = 5x code/CJK and \~6.5–7.3x prose/ASCII. Sonnet→Haiku gets price ratio only (3x, identical tokenizers). Penalty direction: a Fable/Opus-twin advisor reading a prose-heavy long transcript bills more tokens; code-heavy transcripts need their own count.
* **Evidence tier:** T1 (official tooltip) + local measurement (method above).
* **Quality risk:** NEGATIVE-COST — pure accounting, no behavior change. Failure mode is analytical: cost models reusing one token count across tiers are wrong in both directions. Falsification: rerun the table on your corpus; if ratios ≈1.0 the multiplier vanishes.
* **Availability:** CLAUDE-CODE-TODAY (automatic — it is how billing works)
* **Effort to adopt:** none for billing; minutes to re-derive your own multipliers with /tmp/ct.py.
* **Composability:** amplifies techniques 1, 3, 7 when routed content is prose/ASCII-heavy; taxes Fable advisors (technique 4) and Fable uptiers only to the extent their transcript has that mix.
* **Validation protocol:** corpus-level count\_tokens sweep (every file type in the repo, weighted by session-mix) to replace the sample table; report content classes separately and treat near-1.0 code/CJK ratios as expected, not failed measurements.

### 7. Quality-delta map: route by measured gap per task class, not tier superstition [#7-quality-delta-map-route-by-measured-gap-per-task-class-not-tier-superstition]

The data says the Sonnet–Opus gap is now tiny (\~1.2pp SWE-bench, T3) while the Haiku–Sonnet gap is real (\~6pp) — so the high-ROI default is Sonnet-by-default + Haiku-for-read-only + Opus/Fable-for-planning, not Opus-by-default.

* **Layer:** infra (decision data)
* **Mechanism:** verified anchors: Haiku 4.5 = 73.3% SWE-bench Verified and "90% of Sonnet 4.5's performance" in Augment's agentic eval at $1/$5 (T1, Anthropic news). Sonnet 4.6 "approaches Opus-level intelligence", preferred over Opus 4.5 by 59% of Claude Code users (T1). Opus 4.6 = 80.8% vs Sonnet 4.6 = 79.6% SWE-bench Verified (T3 aggregator — 1.2pp gap for a 1.67x price gap). Fable 5's "95% SWE-bench Verified" is an aggregator headline, UNVERIFIED vs primary; docs position Fable for "tasks larger than a single sitting" / ambiguous, long-horizon work. The costs page's own guidance matches: "Sonnet handles most coding tasks well and costs less than Opus. Reserve Opus for complex architectural decisions."
* **Expected savings:** Sonnet-default instead of Opus-default = 40% per-token at \~1.2pp measured coding gap; on an all-Opus modeled day that is up to \~40% of the model-rate component. Haiku for non-writing subtasks = further 3x (plus tokenizer effect under Fable/Opus parents).
* **Evidence tier:** T1 for Haiku 73.3%/Augment 90%/59% preference; T3 for 80.8/79.6 and all Fable 5 scores — verify against model cards before quoting downstream.
* **Quality risk:** NEUTRAL when split by task class. Known failure mode: cheap models on subtle debugging/architecture — the advisor (technique 4) exists precisely to backstop it. Degradation signature: rising retry-loop counts per task. Falsification: track retries/task per model on your own backlog; if Sonnet retries ≫ Opus retries, the 1.2pp story doesn't hold for your workload.
* **Availability:** n/a — informs configuration.
* **Effort to adopt:** none beyond choosing defaults.
* **Composability:** this is the routing table the other techniques implement.
* **Validation protocol:** month-long A/B of defaults (Opus vs Sonnet main) on comparable task streams; compare $/merged-PR and retry counts, not benchmark scores.

### 8. Fan-out economics: orchestrator/worker with cost guardrails (agent teams ≈ 7x tokens) [#8-fan-out-economics-orchestratorworker-with-cost-guardrails-agent-teams--7x-tokens]

Anthropic's blessed pattern — Sonnet orchestrating "a team of multiple Haiku 4.5s" — only saves money if you control the multipliers.

* **Layer:** turn-structure (orchestration)
* **Mechanism:** two fan-out surfaces: (1) subagents — results return to parent; cost = worker corpora at worker rates + summaries re-read at parent rates; (2) agent teams (`CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1`) — separate full instances: "**Agent teams use approximately 7x more tokens than standard sessions when teammates run in plan mode**, because each teammate maintains its own context window" (T1, costs page — note the plan-mode qualifier; not a universal multiplier). Official guardrails, verbatim: "**Use Sonnet for teammates**. It balances capability and cost for coordination tasks" (Sonnet, not Haiku — teammates write code); keep teams small; keep spawn prompts focused (each teammate auto-loads CLAUDE.md + MCP + skills — this repo's <RepoFile path="AGENTS.md">AGENTS.md</RepoFile> costs 2,744 Fable / 1,919 Sonnet tokens per teammate, local measurement); "Active teammates continue consuming tokens even if idle."
* **Expected savings:** a cost-CONTROL technique. Unmanaged: \~7x (teams) or \~Nx (subagents) multiplication — against the documented "$13 per developer per active day" enterprise average, a 7x team day ≈ $91, i.e. \~7 normal days of spend per team-day (ESTIMATE). Managed: downtiered workers convert the multiplier into technique 1's 13x-cheaper-per-worker math.
* **Evidence tier:** T1 (7x figure, guardrails, $13/day, $150–250/month, \<$30/day for 90% of users — costs page); orchestrator quote T1 (Haiku 4.5 announcement, sweep-verified).
* **Quality risk:** RISKY by default (token multiplication, idle burn); NEUTRAL with the documented guardrails. Degradation signature: /usage attribution showing idle teammates accruing. Falsification: same task solo vs 3-teammate team; if team tokens \< 3x solo, the multiplier folklore overstates.
* **Availability:** CLAUDE-CODE-TODAY (teams experimental, env-gated)
* **Effort to adopt:** hours — worker model choices in frontmatter/spawn prompts, team-size caps, teardown habits.
* **Composability:** combines techniques 1 + 5 (effort: low workers) + compressed worker output (15-output-discipline.md). Anti-synergy: parallel identical-prefix workers all pay full cache-write price unless the first request is staggered (API parallel-cache trap; see 13-caching-exploitation.md).
* **Validation protocol:** one representative task three ways (solo / subagent fan-out / 3-teammate team), full token accounting via /usage + Console; compare $/task and wall-clock at equal review-pass.

### 9. Route at boundaries, not mid-conversation: cache-aware switching discipline [#9-route-at-boundaries-not-mid-conversation-cache-aware-switching-discipline]

Every mid-session `/model` or `/effort` change throws away the prompt cache and re-reads the whole history uncached — downtier via subagents/advisor/new-task instead, and savings stay savings.

* **Layer:** cache + turn-structure (session discipline)
* **Mechanism:** verified verbatim : the `/model` picker "asks for confirmation when the conversation has prior output, since the next response re-reads the full history without cached context" (model-config); effort changes invalidate too ("Unlike changing model or effort level, toggling /advisor keeps the cached prefix intact" — advisor page, linking the prompt-caching page's "actions that invalidate the cache" anchor). Cache-safe channels: subagents (parent prefix untouched), forks (explicitly reuse parent cache), advisor (explicitly safe), opusplan (one bounded switch per plan cycle). Fallback chains are availability routing, not cost routing — the switch "lasts for the current turn only."
* **Expected savings:*&#x2A; ESTIMATE (arithmetic): 150K-token cached history on Opus 4.8 = $0.075/turn in cache reads ($0.50/MTok). Switching to Sonnet 4.6: history re-tokenizes to \~115K (÷1.3), one uncached read + 5-min cache write at $3×1.25 = **$0.43*&#x2A;, then $0.035/turn — **\~8.9 further turns to break even on input alone** (faster counting output at $15 vs $25/MTok; never pays back if the session ends sooner). Preserves the phase-0 economics where cache reads are 92.8% of prompt tokens and 32% of dollars.
* **Evidence tier:** T1 mechanics; ESTIMATE break-even (assumptions stated).
* **Quality risk:** NEGATIVE-COST — pure waste avoidance, no quality dimension. Degradation signature in violation: an uncached-input spike in Console right after a switch. Falsification: if Console shows no uncached spike after a deliberate mid-session switch, the invalidation claim is wrong (it won't be — it's documented).
* **Availability:** CLAUDE-CODE-TODAY
* **Effort to adopt:** behavioral only — switch at `/clear`/task boundaries; prefer subagents/advisor for transient tier changes.
* **Composability:** precondition for every other technique's math; see 13-caching-exploitation.md.
* **Validation protocol:** instrument one deliberate mid-session Opus→Sonnet switch at 150K history; read the uncached-input line from Console for that request and compare to the $0.43 ESTIMATE; then route the same subtask through a Sonnet subagent and compare totals.

### 10. RouteLLM-style learned routing (external; paper-grade but stale for Claude stacks) [#10-routellm-style-learned-routing-external-paper-grade-but-stale-for-claude-stacks]

The academic gold standard for difficulty routing — with benchmark-specific numbers and mid-2024 training pairs.

* **Layer:** infra (gateway/self-host router)
* **Mechanism:** RouteLLM (LMSYS/UC Berkeley, ICLR 2025) trains routers (matrix factorization, BERT classifier, SW-ranking) on Chatbot Arena preference data to send each query to a strong or weak model; OpenAI-compatible drop-in server. For Claude Code it sits at `ANTHROPIC_BASE_URL`, where per-request model swaps fight model-scoped prompt caching and Fable thinking-block replay, and the router is blind to agentic turn structure.
* **Expected savings:** verified wording (README + LMSYS blog): "up to 85% while maintaining 95% GPT-4 performance on widely-used benchmarks like MT Bench"; per-benchmark: &#x2A;*85% (MT Bench) / 45% (MMLU) / 35% (GSM8K)**; best MT Bench = 95% GPT-4 performance at 26% GPT-4 calls (14% with augmented data). All vs all-GPT-4, routers trained on gpt-4-1106-preview vs mixtral-8x7b — no published evaluation on 2026 Claude pairs.
* **Evidence tier:** T2 (peer-reviewed) with explicit staleness flag for 2026 Claude use.
* **Quality risk:** **QUALITY-TRADE**, benchmark-dependent — "95% of GPT-4" is a calibrated benchmark score, not task success on your workload. Degradation signature: hard queries mis-routed cheap. Falsification: RouterArena-style evaluation (arXiv 2510.00202) on a Claude 4.x pair.
* **Availability:** GATEWAY-OR-SELF-HOST
* **Effort to adopt:** days–project for Claude Code (router server + translation layer + accepted cache/replay degradation); hours for stateless API apps.
* **Composability:** poor with Claude Code's cache-and-subagent economics; reasonable for stateless single-turn workloads.
* **Validation protocol:** before adopting, run your own 100-task replay through the router vs Sonnet-solo vs technique-7 defaults; compare $/task and success — in-product routing is the control arm.

### 11. Commercial gateway routers: OpenRouter Auto, Martian, LiteLLM — evidence-graded [#11-commercial-gateway-routers-openrouter-auto-martian-litellm--evidence-graded]

Gateway auto-routing exists and is priced sanely, but published evidence is thin-to-vendor-grade, and in-product primitives beat it for Claude Code.

* **Layer:** infra (gateway)
* **Mechanism:** OpenRouter Auto (`openrouter/auto`): NotDiamond-powered meta-model picks from a curated set; "priced at the same rate as the routed model" (no router fee); pins model+provider per conversation "to maximize prompt cache hits" — but publishes NO savings or quality numbers (docs; the direct old URL 404'd, wording via search excerpt). Martian: marketing claims "cutting costs by 20% to 97%" — no public methodology found; T4 vendor claim. LiteLLM Router: documented strategies (simple-shuffle, rate-limit-aware-v2, latency-based, usage-based, least-busy, cost-based, custom) are **load balancing across deployments of the same model group — difficulty tiering is explicitly not built in**; its documented Claude Code role is spend TRACKING on Bedrock/Vertex/Foundry (costs page: "several large enterprises reported using LiteLLM… to track spend by key"; "unaffiliated with Anthropic").
* **Expected savings:** OpenRouter: none published. Martian: 20–97% (vendor, unverified — do not propagate). LiteLLM: n/a (not a quality router).
* **Evidence tier:** T1 for documented mechanics/pricing; T4 for Martian/NotDiamond savings; T2 reference (RouterArena) for how to evaluate any of them.
* **Quality risk:** RISKY for agentic coding sessions (cache/replay/turn-structure blindness); NEUTRAL for stateless API traffic. Falsification: RouterArena leaderboard numbers on your task mix.
* **Availability:** GATEWAY-OR-SELF-HOST (Claude Code reaches gateways via `ANTHROPIC_BASE_URL`; `CLAUDE_CODE_ENABLE_GATEWAY_MODEL_DISCOVERY=1` populates the picker)
* **Effort to adopt:** hours–days (plumbing) + governance review.
* **Composability:** useful for multi-provider governance and spend tracking; redundant with and inferior to techniques 1–4 for the cost problem itself.
* **Validation protocol:** same as technique 10 — a 100-task replay with in-product routing as the control arm; adopt only if the gateway wins on $/success.

### 12. Availability/content fallback chains as routing hygiene (`--fallback-model`, Fable classifier fallback, `availableModels`) [#12-availabilitycontent-fallback-chains-as-routing-hygiene---fallback-model-fable-classifier-fallback-availablemodels]

Not a saver per se — routing infrastructure to configure correctly so cost routing doesn't silently break.

* **Layer:** infra (config)
* **Mechanism:** All T1, model-config: (1) `claude --fallback-model sonnet,haiku` / `fallbackModel` array (capped at 3 after dedupe) switches on overload/unavailability only — "Authentication, billing, rate-limit, request-size, and transport errors never trigger a switch"; "The switch lasts for the current turn only." (2) Fable 5 content fallback: classifier-flagged requests (cybersecurity/biology) re-run on Opus 4.8 and "The session then continues on that Opus model" until `/model fable`; can trigger "on the first request of a session" from CLAUDE.md/git context alone; biology workloads should "expect nearly all requests to reroute"; `--safe-mode` diagnoses, and a `/config` toggle can make it ask first. (3) `availableModels` (managed settings) restricts `/model`/`--model`/`ANTHROPIC_MODEL` and **silently drops out-of-list fallback-chain elements** ("dropped when the chain is read and never tried") — `["sonnet","haiku"]` is the bluntest shipped org-wide cost cap that still leaves Default usable.
* **Expected savings:** indirect — prevents failed-turn retries (each a full uncached re-submit) and enables org-level allowlisting; the Fable→Opus content fallback silently changes the billing model for whole session classes (cheaper per token, capability and cache-reset consequences, wrong model attribution in dashboards).
* **Evidence tier:** T1.
* **Quality risk:** NEUTRAL; the Fable→Opus fallback is a quality change you didn't choose — monitor for it in security-adjacent repos. Degradation signature: transcript notice + Console usage appearing under Opus for a "Fable" session. Falsification: start a session in a security-heavy repo and check the first-request model in Console.
* **Availability:** CLAUDE-CODE-TODAY (chains, classifier fallback, allowlist) / SDK (server-side `fallbacks` beta retries Fable refusals on `claude-opus-4-8` in one round trip)
* **Effort to adopt:** minutes — one flag/setting.
* **Composability:** interacts with technique 9 (every fallback switch is a cache-cold turn on the target) and with cost dashboards (attribute per-model).
* **Validation protocol:** simulate by pinning a retired model with a 2-element chain; confirm turn-scoped switching and that an allowlist drops the out-of-list element; in security repos, audit one week of Console usage for silent Fable→Opus sessions.

## Claims to kill [#claims-to-kill]

| Folklore claim                                                                | What's actually true                                                                                                                                                                                                                                             |
| ----------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| "RouteLLM cuts costs 85% at 95% GPT-4 quality"                                | Benchmark-specific: 85% MT Bench / 45% MMLU / 35% GSM8K; routers trained on mid-2024 gpt-4 vs mixtral; no Claude-4.x evaluation exists                                                                                                                           |
| "LiteLLM routes easy queries to cheap models"                                 | Its strategies are load balancers across deployments of one model group; difficulty tiering is an app-level concern (docs.litellm.ai)                                                                                                                            |
| "Subagents are automatically cheaper"                                         | Default is `model: inherit`; only built-in Explore is pinned to Haiku. An Opus/Fable session pays full parent rates on every default worker                                                                                                                      |
| "Switching to a cheaper model mid-session immediately saves money"            | The switch invalidates the model-scoped cache; "the next response re-reads the full history without cached context" — ESTIMATE \~$0.43 vs $0.075 on a 150K history, \~9 turns to break even. Same trap for mid-session `/effort` changes                         |
| "Downtiering always saves more than the price-sheet ratio (Fable→Haiku >10x)" | Over-broad: Fable/Opus 4.8 bill more tokens on prose/ASCII, but code/CJK can be neutral. Use 10x for code-heavy Fable→Haiku, \~13–14.5x for prose/markdown-heavy, and exactly 3x for Sonnet→Haiku (identical tokenizer).                                         |
| "Martian cuts costs 20–97%"                                                   | Vendor marketing, no public methodology; the valuation buzz is a Medium-article rumor                                                                                                                                                                            |
| "Always put the strongest model in charge and delegate down"                  | Anthropic's measured advisor data inverts it: Sonnet-main + Opus-advisor beat Sonnet alone (+2.7pp) while costing 11.9% LESS; docs recommend Sonnet (not Opus) teammates. Cheap-executor/strong-advisor is the only pattern with published negative-cost numbers |
| "Haiku for everything cheap, including teammates"                             | Costs page: "Use Sonnet for teammates" (teammates implement code; the \~6pp Haiku–Sonnet gap costs retry loops). Haiku is for "simple subagent tasks" — task-class-specific, not blanket                                                                         |
| "ultrathink raises the effort level"                                          | "The effort level sent to the API is unchanged" — it adds an in-context instruction only; "think hard" variants aren't even recognized keywords (model-config)                                                                                                   |

## Gaps — what would upgrade this file [#gaps--what-would-upgrade-this-file]

1. Sonnet/Opus/Fable SWE-bench Verified numbers (80.8 / 79.6 / "95%") are T3 aggregator transcriptions — verify against model cards/PDFs before quoting downstream.
2. No published end-to-end measurement of Haiku-subagent fan-out savings exists; the 92–93% figure is ESTIMATE arithmetic. A transcript-replay with per-model count\_tokens would make it T1.
3. Anthropic publishes zero per-level effort percentages — technique 5's validation protocol is the highest-value local experiment in this area.
4. Tokenizer measurements are n=5 here (n=1 per content type in phase-0); prose samples exceed the official 1.35x ceiling — a corpus-level sweep would tighten every multiplier.
5. The effort-invalidates-cache claim rests on the advisor page's sentence + its link anchor; the prompt-caching page itself was not fetched in this pass.
6. Advisor consult frequency is model-driven and uncapped — worst-case chatty-advisor cost on long transcripts (uncached full reads) is unmeasured.
7. RouterArena leaderboard specifics (which routers win, at what oracle fraction) were not extracted.

## Verification ledger [#verification-ledger]

| Number / claim                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | Source or method                                                                                                                                                                                  |
| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Fable 5 $10/$50; Opus 4.8 $5/$25; Sonnet 4.6 $3/$15; Haiku 4.5 $1/$5 /MTok                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | [https://platform.claude.com/docs/en/about-claude/models/overview.md](https://platform.claude.com/docs/en/about-claude/models/overview.md) (fetched, verbatim table)                              |
| "roughly 30% more tokens" (Opus-4.7+ tokenizer); 1M = \~555k words (Opus 4.8) vs \~750k (Sonnet 4.6); Haiku context 200k; Fable max output 128k                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | same page, tooltip text quoted verbatim                                                                                                                                                           |
| Tokenizer table: 830/830/629/629 … 183/183/126/126; ratios 1.320–1.452x                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | local: `python3 /tmp/ct.py &lt;model&gt; < sample`, samples described in-section                                                                                                                  |
| <RepoFile path="AGENTS.md">AGENTS.md</RepoFile> = 2,744 Fable tok (phase-0: 2,738) / 1,919 Sonnet-Haiku tok                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | local /tmp/ct.py on repo <RepoFile path="AGENTS.md">AGENTS.md</RepoFile> (6,366 B)                                                                                                                |
| Phase-0: thinking 54.8% of output; prompt mix 0.44/6.73/92.83%; dollar split 32/29/20/17/2; \~$22/heavy day                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | local phase-0 measurement (see 01-economics-and-measurement.md)                                                                                                                                   |
| +2.7pp SWE-bench Multilingual, −11.9% cost/task (Sonnet+Opus-advisor); BrowseComp 41.2% vs 19.7%; −29% score / −85% cost (Haiku+Opus vs Sonnet solo)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | [https://claude.com/blog/the-advisor-strategy](https://claude.com/blog/the-advisor-strategy) (fetched, sentences quoted verbatim)                                                                 |
| Advisor: pairing matrix; Fable-main accepts only Fable; v2.1.98+/v2.1.170+; bills at advisor rates; toggle cache-safe; advisor transcript reads uncached, "anew"; "no setting to cap or force"; Anthropic API only; experimental wording                                                                                                                                                                                                                                                                                                                                                                                                                              | [https://code.claude.com/docs/en/advisor](https://code.claude.com/docs/en/advisor) (fetched, verbatim)                                                                                            |
| Subagent default `inherit`; resolution order (env > per-invocation > frontmatter > main); model values incl. `fable`; `effort:` frontmatter; "Control costs by routing tasks to faster, cheaper models like Haiku"; Explore "Model: Haiku", read-only; "Explore and Plan skip your CLAUDE.md files and the parent session's git status"; fork reuses parent cache, model "Same as main session"                                                                                                                                                                                                                                                                       | [https://code.claude.com/docs/en/sub-agents](https://code.claude.com/docs/en/sub-agents) (fetched, verbatim)                                                                                      |
| "For simple subagent tasks, specify model: haiku"; agent teams "approximately 7x more tokens… when teammates run in plan mode"; "Use Sonnet for teammates"; idle teammates consume; $13/dev/active-day, $150–250/month, \<$30/day for 90%; background "typically under $0.04 per session"; LiteLLM for spend tracking, "unaffiliated with Anthropic"; plan mode prevents "expensive re-work"                                                                                                                                                                                                                                                                          | [https://code.claude.com/docs/en/costs](https://code.claude.com/docs/en/costs) (fetched, verbatim)                                                                                                |
| opusplan = opus in plan mode, sonnet for execution; plan phase capped at 200K; /model picker warns next response "re-reads the full history without cached context"; alias resolution (API: opus→4.8, sonnet→4.6; Bedrock/Vertex/Foundry: 4.6/4.5); effort levels/defaults per model, "calibrated per model"; `CLAUDE_CODE_SUBAGENT_MODEL` covers "all subagents and agent teams"; fallback chains: cap 3, turn-only, never on auth/billing/rate-limit; `availableModels` drops out-of-list chain elements; Fable classifier fallback persists until `/model fable`, can fire on first request, biology "nearly all requests"; ultrathink leaves API effort unchanged | [https://code.claude.com/docs/en/model-config](https://code.claude.com/docs/en/model-config) (fetched, verbatim)                                                                                  |
| effort `low` = "significant token savings… such as subagents"; Sonnet 4.6 `medium` recommended despite `high` default                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | [https://platform.claude.com/docs/en/build-with-claude/effort.md](https://platform.claude.com/docs/en/build-with-claude/effort.md)                                                                |
| Haiku 4.5 = 73.3% SWE-bench Verified (50-trial avg, 128K thinking budget); "90% of Sonnet 4.5" (Augment); Sonnet-orchestrating-Haikus quote                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | [https://www.anthropic.com/news/claude-haiku-4-5](https://www.anthropic.com/news/claude-haiku-4-5)                                                                                                |
| Sonnet 4.6 preferred over Opus 4.5 by 59%; "approaches Opus-level intelligence"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | [https://www.anthropic.com/news/claude-sonnet-4-6](https://www.anthropic.com/news/claude-sonnet-4-6)                                                                                              |
| Opus 4.6 = 80.8%, Sonnet 4.6 = 79.6% SWE-bench Verified; Fable 5 "95%" headline                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | [https://www.morphllm.com/claude-benchmarks](https://www.morphllm.com/claude-benchmarks) (T3 aggregator; UNVERIFIED vs primary)                                                                   |
| RouteLLM "up to 85%… like MT Bench"; 85/45/35 per-benchmark; 26%/14% GPT-4-call fractions; trained on gpt-4-1106-preview vs mixtral-8x7b; ICLR 2025                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | [https://github.com/lm-sys/RouteLLM/blob/main/README.md](https://github.com/lm-sys/RouteLLM/blob/main/README.md) + [https://www.lmsys.org/blog/-routellm/](https://www.lmsys.org/blog/-routellm/) |
| RouterArena exists as standardized router evaluation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | [https://arxiv.org/abs/2510.00202](https://arxiv.org/abs/2510.00202)                                                                                                                              |
| OpenRouter Auto: NotDiamond-powered, billed at routed model's rate, conversation stickiness "to maximize prompt cache hits"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | [https://openrouter.ai/docs/guides/routing/routers/auto-router](https://openrouter.ai/docs/guides/routing/routers/auto-router) (via search excerpt; direct old URL 404'd)                         |
| LiteLLM strategies = load balancing only                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | [https://docs.litellm.ai/docs/routing](https://docs.litellm.ai/docs/routing)                                                                                                                      |
| Martian "20% to 97%"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | [https://route.withmartian.com/](https://route.withmartian.com/) (vendor claim, no methodology)                                                                                                   |
| $2.75 → $0.19–0.28 fan-out (−90/93%); 10x code-heavy and \~13–14.5x prose-heavy effective Fable→Haiku; $0.43 vs $0.075 switch tax, 8.9-turn break-even; $0.75/Opus-consult on 150K transcript; 7×$13 ≈ $91 team-day                                                                                                                                                                                                                                                                                                                                                                                                                                                   | local arithmetic from verified prices + measured token ratios (formulas shown in-section) plus independent tokenizer correcti — ESTIMATE                                                          |
