# 01 — Token Economics and Measurement (https://jackin.tailrocks.com/research/token-optimization/01-economics-and-measurement/)



# 01 — Token Economics and Measurement [#01--token-economics-and-measurement]

Pricing and feature claims are verified against live Anthropic
documentation (URLs in the Verification ledger).

## TL;DR [#tldr]

* **Verified live pricing :** Fable 5 $10/$50 per MTok in/out, cache read $1
  (0.1×), 5-min cache write $12.50 (1.25×), 1-hour write $20 (2×); Sonnet 4.6 $3/$15; Haiku 4.5
  $1/$5; Batch API 50% off everything. **1M-token context is now standard-priced** on Fable
  5/Opus 4.8/Sonnet 4.6 — the brief's assumption of a long-context premium is outdated.
* **The exchange rates that decide everything:** 1 output token = 5 input tokens = 50 cache-read
  tokens. Thinking bills as output. A token avoided in *output* is worth 50× a token avoided in
  *cached input*.
* **Tokenizer divergence is official, but content-specific:** Anthropic documents that Opus
  4.7+/Fable 5 use a new tokenizer producing "roughly 30%" more tokens for the same text. Local
  follow-up shows the premium is strong on English/ASCII prose, but code/CJK can be near-neutral.
  Cross-model dollar math must count the target corpus, not just prices.
* **Ground truth lives in the API usage object** (`input_tokens`, `cache_creation_input_tokens`,
  `cache_read_input_tokens`, `output_tokens`); `count_tokens` is free (rate-limited 100–8,000
  RPM by tier) and is the experiment instrument; Claude Code exposes `/cost`, `/context`, and
  full OTel metrics (`claude_code.token.usage` by type/model/agent).
* **Modeled heavy-day profile (basis for stack math):** the $17 floor variant is \~25k uncached
  input, \~400k cache-write, \~5.5M cache-read, \~125k output tokens on Fable 5; the $22 working
  variant in `30` scales this to six sessions. The measured split is profile-specific; the stable
  invariant is that cache reads dominate token volume while output + cache writes dominate dollars.

***

## 1. Verified price table (live) [#1-verified-price-table-live]

From the live pricing page (platform.claude.com/docs/en/about-claude/pricing):

| Model             | Input | 5m cache write | 1h cache write | Cache read | Output |   Batch in/out |
| ----------------- | ----: | -------------: | -------------: | ---------: | -----: | -------------: |
| Claude Fable 5    |   $10 |         $12.50 |            $20 |      $1.00 |    $50 |       $5 / $25 |
| Claude Opus 4.8   |    $5 |          $6.25 |            $10 |      $0.50 |    $25 | $2.50 / $12.50 |
| Claude Sonnet 4.6 |    $3 |          $3.75 |             $6 |      $0.30 |    $15 |  $1.50 / $7.50 |
| Claude Haiku 4.5  |    $1 |          $1.25 |             $2 |      $0.10 |     $5 |  $0.50 / $2.50 |

All $/MTok. Multipliers are uniform across the lineup: cache write 1.25× (5-min TTL) or 2×
(1-hour TTL), cache read 0.1×, batch 0.5× — and they stack (batch + caching compose).

Verified side facts that matter for optimization math:

* **Long context:** Fable 5, Opus 4.8/4.7/4.6, Sonnet 4.6 include the full 1M-token window **at
  standard pricing** — "a 900k-token request is billed at the same per-token rate as a 9k-token
  request" (pricing page). No premium tier to engineer around anymore; the cost of a
  bloated context is linear, not super-linear — but quality degradation with context length is
  not (see 12-context-architecture.md).
* **Fast mode** (research preview): Opus 4.8 at $10/$50 — i.e., Opus-fast costs exactly Fable 5
  list. Speed, not savings.
* **US data residency** (`inference_geo: "us"`): 1.1× on every token class. Don't set it idly.
* **Tool-use system prompt** is billed and model-specific: 290 tokens on Opus 4.8, 497 on Sonnet
  4.6/Opus 4.6, 496 on Haiku 4.5 (`auto` choice; pricing page table). Local measurement on Fable
  5: \~318 tokens with one minimal tool. Any non-empty `tools` array pays this once per request.
* **Server tools:** web search $10/1,000 searches; web fetch free beyond tokens; bash tool +245
  input tokens; text editor tool +700 tokens (Claude 4.x).

## 2. The exchange rates — why token classes are different currencies [#2-the-exchange-rates--why-token-classes-are-different-currencies]

On Fable 5, per token:

| Class                             | $/MTok | Relative to cache read |
| --------------------------------- | -----: | ---------------------: |
| Cache read                        |  $1.00 |                     1× |
| Uncached input                    | $10.00 |                    10× |
| 5m cache write                    | $12.50 |                  12.5× |
| 1h cache write                    | $20.00 |                    20× |
| Output (visible **and thinking**) | $50.00 |                **50×** |

Consequences, mechanical but decisive:

1. **Output discipline is worth 50× cache-read discipline per token.** A technique that trims
   1,000 output tokens equals one that trims 50,000 cache-read tokens. This single ratio reorders
   most folklore tier lists, which obsess over input-side prompt slimming.
2. **Thinking is output.** Locally measured at 54.8% of output tokens in a max-effort session
   (02-baseline-audit.md). Any output-side technique that doesn't touch thinking (all style
   layers) caps out at the visible share.
3. **Cache reads are cheap, not free — and they're the volume king.** 92.8% of prompt-side
   tokens in the measured session were cache reads; at 0.1× they were still a major dollar line
   (32% in that session; 21% in an independent output-heavy session), because the entire
   conversation prefix is re-read on *every* API call.
   Context mass costs ≈ `prefix_tokens × 0.1× × calls_per_session`, so a 2,738-token always-on
   CLAUDE.md chain costs \~52k cache-read tokens over a 19-call session — plus its share of cache
   writes whenever the prefix re-forms.
4. **Break-even arithmetic for cache writes:** 5-min write (1.25×) pays for itself after **one**
   read within TTL (1.25 + 0.1 \< 2 × 1.0). 1-hour write (2×) needs ≥2 reads (confirmed in live
   docs: "caching pays off after just one cache read for the 5-minute duration… after two cache
   reads for the 1-hour duration"). Re-deriving the idle-gap economics from these multipliers is
   done in 13-caching-exploitation.md.

## 3. Tokenizer divergence — tokens are not a stable unit across models [#3-tokenizer-divergence--tokens-are-not-a-stable-unit-across-models]

Official, from the live docs :

> "Opus 4.7 and later use a new tokenizer… This new tokenizer may use up to 35% more tokens for
> the same fixed text." (pricing page)
> "Claude Fable 5 … uses the tokenizer introduced with Claude Opus 4.7, which produces roughly
> 30% more tokens than models before Claude Opus 4.7 for the same text." (token-counting page)

Local confirmation (02-baseline-audit.md): identical text counts +15% (Python code) to +38%
(English prose) on Fable 5 vs Sonnet 4.6; CJK diverges least.

Implications:

* **Cross-tier routing saves more than list prices imply.** Moving prose-heavy work Fable 5 →
  Sonnet 4.6 cuts price per token 3.3× *and* tokens per text \~1.2–1.4×: effective \~4–4.6× on
  input classes. Quantified per task class in 16-model-routing-and-delegation.md.
* **Never reuse token counts measured on one tokenizer to budget another** (docs say this
  explicitly for migration). All measurements in this dossier name the model they were counted on.
* Tokens/char measured locally on Fable 5: \~2.3–3.4 for English/markdown, \~1.4–1.6 for CJK, more
  granular tables in 11-tokenizer-arbitrage.md.

## 4. Measurement instruments (how to see spend at all) [#4-measurement-instruments-how-to-see-spend-at-all]

**API usage object — the ground truth.** Every Messages response carries
`usage.input_tokens` (uncached), `usage.cache_creation_input_tokens` (further broken down in
`usage.cache_creation.ephemeral_5m_input_tokens` / `ephemeral_1h_input_tokens`),
`usage.cache_read_input_tokens`, `usage.output_tokens` (visible + thinking + tool\_use blocks).
Total prompt size = input + cache\_creation + cache\_read. All dossier arithmetic uses these fields.

**`count_tokens` — the free experiment instrument.** `POST /v1/messages/count_tokens` accepts
messages/system/tools/thinking exactly like a real request. Verified live (token-counting page): **free**, separate rate limit (100 RPM tier 1 → 8,000 RPM tier 4), counts are an
"estimate" that "may differ by a small amount" (system-added tokens are not billed), it never
touches the cache, and **thinking blocks from previous assistant turns are ignored** — matching
the production rule that prior-turn thinking is stripped from billed context (a built-in,
automatic saving documented in 18-provider-features.md).

**Claude Code surfaces.**

* `/cost` — per-session totals; `/context` — live context-window decomposition (system prompt,
  tools, MCP, memory files, messages). Quick, but session-scoped and manual.
* **OpenTelemetry** (`CLAUDE_CODE_ENABLE_TELEMETRY=1`, OTLP exporters): metrics
  `claude_code.token.usage` (attribute `type` ∈ `input` / `output` / `cacheRead` /
  `cacheCreation`, plus `model`, and `skill.name` / `plugin.name` / `agent.name` for attributing
  spend to skills, plugins, and subagents), `claude_code.cost.usage` (USD),
  `claude_code.active_time.total`, plus events: `claude_code.api_request` (per-call tokens+cost),
  `claude_code.compaction`, `claude_code.tool_decision`. `OTEL_LOG_RAW_API_BODIES=1` captures
  full request/response bodies (60 KB truncation; **thinking content is always redacted**).
  This is the right backbone for a personal token dashboard; the per-`agent.name` attribution is
  exactly what's needed to measure subagent economics (17-multi-agent-protocols.md).
  (code.claude.com/docs/en/monitoring-usage.)
* **Session JSONL transcripts** (`~/.claude/projects/&lt;project&gt;/&lt;session&gt;.jsonl`) — per-call
  usage including cache fields. Two traps found locally (02-baseline-audit.md): the same
  `message.usage` repeats on every content-block line (dedup by `message.id` or overcount \~3×),
  and thinking text is redacted (`thinking: ""`), so thinking must be *inferred* as
  `output_tokens − count_tokens(visible blocks)`.
* **Console usage pages / ccusage-style analyzers** — covered with the market scan in
  03-prior-art-and-market-scan.md; JSONL-based analyzers inherit both transcript traps above.

## 5. The modeled session profile (basis for all stack arithmetic) [#5-the-modeled-session-profile-basis-for-all-stack-arithmetic]

Measured base (this environment, 02-baseline-audit.md): one heavy orchestration session = 19 API
calls; 5,475 uncached input / 84,693 cache-write / 1,167,417 cache-read / 26,977 output tokens
(54.8% thinking at max effort).

**HEAVY-DAY profile.*&#x2A; Two internally-consistent variants are used in this dossier — a $17
floor (5 sessions, 45% thinking — the table below) and a &#x2A;*$22 working figure (6 sessions, 55%
thinking)** that the area files (10–19) and the composed stacks (30) adopt; both scale the same
measured session, and every stack multiplier is unchanged across them. Table for the $17 floor
(assumption-explicit scaling; multiply rows by 1.2 and shift the thinking split for the $22
variant):

| Class                   | Tokens/day | × Fable 5 price |      $/day | Share |
| ----------------------- | ---------: | --------------: | ---------: | ----: |
| Uncached input          |     25,000 |        $10/MTok |      $0.25 |  1.5% |
| Cache write (5m)        |    400,000 |     $12.50/MTok |      $5.00 | 29.4% |
| Cache read              |  5,500,000 |         $1/MTok |      $5.50 | 32.4% |
| Output — thinking (45%) |     56,250 |        $50/MTok |      $2.81 | 16.5% |
| Output — visible (55%)  |     68,750 |        $50/MTok |      $3.44 | 20.2% |
| **Total**               |            |                 | **$17.00** |  100% |

Assumptions recorded: (a) main-loop only — subagent/workflow fan-out multiplies this profile and
is modeled separately in 17-multi-agent-protocols.md; (b) cache behavior healthy (92.8% read
share measured) — a session with idle gaps >5 min degrades writes into the dominant line;
(c) thinking share 45% (measured 54.8% at max effort; lower effort settings reduce it — sweep in
15-output-discipline.md). Sensitivity: at 55% thinking the output rows become $3.44/$2.81
reversed; total unchanged. At a 70% cache-read share (sloppier sessions), day cost rises \~$3–4
from extra writes.

**The $/task metric (brief §4).** All stack claims in 30-composed-stacks.md use:
`Nx = (baseline $/completed task) ÷ (optimized $/completed task)&#x60; at statistically
indistinguishable success on the validation suite (31-validation-harness.md). With \~10 completed
tasks/heavy day, baseline ≈ **$1.70/task** on this profile.

## Verification ledger [#verification-ledger]

| Claim / number                                                                                                                                                        | Basis                                                                                                                                        |
| --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------- |
| Full price table, multipliers, batch 50%, 1M standard pricing, fast-mode prices, inference\_geo 1.1×, tool-use system prompt sizes, web search $10/1k                 | [https://platform.claude.com/docs/en/about-claude/pricing](https://platform.claude.com/docs/en/about-claude/pricing)                         |
| count\_tokens free, RPM tiers 100/2,000/4,000/8,000, estimate caveat, prior-turn thinking ignored, no caching                                                         | [https://platform.claude.com/docs/en/build-with-claude/token-counting](https://platform.claude.com/docs/en/build-with-claude/token-counting) |
| Tokenizer "roughly 30% more tokens" (Opus 4.7+ tokenizer)                                                                                                             | token-counting page, same access date; "up to 35%" on pricing page                                                                           |
| OTel metric names/attributes, OTEL\_LOG\_RAW\_API\_BODIES, thinking always redacted                                                                                   | [https://code.claude.com/docs/en/monitoring-usage](https://code.claude.com/docs/en/monitoring-usage)                                         |
| Session decomposition, thinking 54.8%, prompt mix 0.44/6.73/92.83, tokenizer +15–38% local (prose-specific; code/CJK near-neutral); session split is profile-specific | Local measurements, methods in 02-baseline-audit.md                                                                                          |
| Heavy-day profile                                                                                                                                                     | ESTIMATE — measured session × 5, assumptions stated in §5                                                                                    |
