# 11 — Tokenizer arbitrage (https://jackin.tailrocks.com/research/token-optimization/11-tokenizer-arbitrage/)



# 11 — Tokenizer arbitrage [#11--tokenizer-arbitrage]

**TL;DR**

* Claude bills BPE tokens, not characters. Re-encoding the *same* information changes cost with zero information loss. Biggest measured lever: serialization format — the same 10-record/7-field dataset spans &#x2A;*986 tokens (pretty XML) to 402 (CSV)** on `claude-fable-5`, a **2.45x spread*&#x2A; (local run A; a parallel local run B on a different dataset measured 3.1x). Minified JSON → CSV is &#x2A;*-34%*&#x2A;, pretty JSON → minified is &#x2A;*-29% (Fable) / -41% (Sonnet)**.
* Anthropic runs exactly **two live tokenizer families**: official docs confirm Fable 5/Mythos 5 use the Opus-4.7 tokenizer, "roughly 30% more tokens" than older models. Locally the premium is **an ASCII/English tax*&#x2A;: \~0% on CJK/digits/emoji, +15.6% Python, +34% minified JSON, +59% English prose, **+132% SCREAMING\_SNAKE**. Family equality is *exact*: Haiku 4.5 = Sonnet 4.6 and Opus 4.8 = Fable 5 on every probe.
* Effective price ratio for identical English prose: **Fable input costs 5.3x Sonnet** (not the 3.3x list ratio) and the same context window holds **37% less** English on the new tokenizer.
* Most character-level folklore is dead on measurement: indent width is literally free (862=862=862 tokens for 2sp/4sp/tab JSON), CSV=TSV=PSV to the exact token, YAML *loses&#x2A; to minified JSON, hex≈base64 per byte, wenyan/Chinese never beats English. Cheap real wins: epoch timestamps &#x2A;*-67%*&#x2A;, no thousands separators (&#x2A;*+50.6%*&#x2A; tax), de-hyphenated UUIDs &#x2A;*-17%*&#x2A;, ASCII tables vs box-drawing &#x2A;*-20 to -28%**.
* Everything here is testable in minutes for $0 via the **free `count_tokens` endpoint** (own rate pool, 100–8,000 RPM by tier) — the only authoritative counter; all cl100k-proxy "Claude calculators" are wrong by construction.

## Method [#method]

All "local" numbers: against the live `count_tokens` API via `/tmp/ct.py` (battery script `/tmp/tok-arb/measure.py`, raw results `/tmp/tok-arb/results.json`). Delta method: `tokens(payload) = count(payload as single user message) − overhead`, with overhead calibrated per model as `count("a") − 1` (measured: 6 on `claude-fable-5` and `claude-opus-4-8`, 7 on `claude-sonnet-4-6` and `claude-haiku-4-5`). Caveats: (a) docs say counts are an **estimate** that "may differ by a small amount" from billing; (b) payloads are n=1 samples — a parallel local battery the same day ("run B", different 10x7 dataset, same method) produced the same orderings with shifted magnitudes, quoted below where it widens the range; (c) isolated 1–12 digit probes on Sonnet read a constant +1 over the ceil(n/3) law (single-message boundary artifact); the law is confirmed exactly in-context (20 8-digit ints + 19 newlines = 79 Fable tokens = 20x3+19).

Dollar arithmetic uses the modeled heavy-day profile from 01-economics-and-measurement.md: \~$22/day at Fable 5 list prices, split 32% cache reads / 29% cache writes / 20% thinking / 17% visible output / 2% uncached input (local measurement).

## The measured format table (10 records x 7 fields: id, name, email, role, region, ISO date, float score) [#the-measured-format-table-10-records-x-7-fields-id-name-email-role-region-iso-date-float-score]

| Format                           | chars | Fable 5 tok         | Sonnet 4.6 tok      | vs minified JSON (Fable) |
| -------------------------------- | ----- | ------------------- | ------------------- | ------------------------ |
| XML, pretty                      | 2,222 | 986                 | 901                 | +61%                     |
| JSON indent=2                    | 1,824 | 862                 | 777                 | +41%                     |
| JSON indent=4                    | 2,144 | **862**             | **777**             | +41%                     |
| JSON indent=tab                  | 1,664 | **862**             | **777**             | +41%                     |
| YAML (block)                     | 1,291 | 628                 | 543                 | +2.8%                    |
| JSON minified                    | 1,343 | 611                 | 456                 | baseline                 |
| Markdown table, padded           | 1,103 | 544                 | 448                 | -11%                     |
| Markdown table, compact          | 773   | 440                 | 348                 | -28%                     |
| JSON `{"fields":[],"rows":[[]]}` | 877   | 437                 | 327                 | -28.5%                   |
| TOON                             | 755   | 419                 | 333                 | -31.4%                   |
| CSV / TSV / PSV                  | 721   | **402 / 402 / 402** | **317 / 317 / 317** | -34.2%                   |

T1, local run A. Run B (different field mix, shorter values → key overhead dominates more): XML 945, pretty 779, min 548, TOON 317, CSV 301 — i.e. min→CSV ranged &#x2A;*-34% to -45%** across the two samples. Savings scale with the key:value length ratio of *your* data; measure yours (see last technique).

***

### Tabular re-serialization: JSON record arrays → CSV/TSV, or TOON when you want guardrails [#tabular-re-serialization-json-record-arrays--csvtsv-or-toon-when-you-want-guardrails]

One-line pitch: state the schema once, not per record — the biggest input-format lever measured.

* **Layer:** input (tool results, context files, RAG payloads)
* **Mechanism:*&#x2A; JSON repeats every key plus quotes/braces per record; BPE charges for all of it. Column-once formats pay for keys a single time. Local: minified JSON 611 → CSV 402 Fable (&#x2A;*-34.2%*&#x2A;; Sonnet -30.5%; run B -45%); pretty JSON → CSV &#x2A;*-53.4%*&#x2A; (run B -61%). Delimiter is a pure no-op: CSV=TSV=PSV exactly (402/402/402 Fable, 317/317/317 Sonnet). TOON costs **+4.2% over CSV** locally (run B +5.3%; TOON repo's own figure +5.9% on o200k\_base, 67,778 vs 63,997 tokens) while adding `[N]` length and `{fields}` schema guardrails.
* **Expected savings:*&#x2A; ESTIMATE on the modeled profile: if structured record data is \~15% of prompt-side tokens (assumption), a 34% cut on that slice saves 0.34 x 0.15 x (32%+29%+2% of $22) ≈ &#x2A;*$0.71/day ≈ 3.2%** of day cost; rises toward 5% where pretty JSON was entering context.
* **Evidence tier:** T1 token counts (local). Accuracy: T3 — TOON repo's 209-question, 4-model benchmark (o200k counts): TOON 76.4% vs JSON 75.0% overall at 39.9% fewer tokens; claude-haiku row TOON 59.8% / JSON 57.4% / CSV 50.5% (CSV evaluable on only 109/209 flat questions). No current-generation Claude accuracy benchmark exists (gap).
* **Quality risk:** NEUTRAL for TOON; **RISKY for bare CSV** on retrieval-heavy tasks — repo benchmark shows CSV materially worst, and in its truncation-detection test CSV scored 100% vs TOON 0.0% the other way (TOON's declared `[N]` can mask silently dropped rows it was designed to catch — read the fine print both ways). Degradation manifests as wrong-row lookups.
* **Availability:** CLAUDE-CODE-TODAY — transform at the MCP/tool boundary or via a hook; a community Claude Code **UserPromptSubmit** hook already auto-converts JSON in prompts to TOON (gist).
* **Effort to adopt:** hours (hook or serializer change); only for uniform arrays — repo's own caveat: compact JSON often *wins* on deeply nested/non-uniform data.
* **Composability:** multiplies through caching (smaller static content = cheaper cache write AND 0.1x reads thereafter; see 13-caching-exploitation.md); stacks with numeric/ID normalization below (those dominate per-row cost); orthogonal to model routing.
* **Validation protocol:** take 50 retrieval questions over your own top-3 highest-volume tool outputs; run each task with JSON vs CSV vs TOON payloads on your production model; accept the cheapest format whose accuracy drop is \<1pp; record token delta via `count_tokens`.

### Minify JSON — and stop arguing about indentation, which is free [#minify-json--and-stop-arguing-about-indentation-which-is-free]

One-line pitch: strip pretty-printing for -29% (Fable) / -41% (Sonnet); indent *width* itself costs 0.

* **Layer:** input
* **Mechanism:** local: JSON at indent=2 / indent=4 / indent=tab is **token-identical** (862=862=862 Fable; 777=777=777 Sonnet) despite a 480-char spread — whitespace runs absorb into \~1 token regardless of width (8 spaces+x = 2 tokens, 24 spaces+x = 3 tokens, both families). The pretty-print tax is per *line&#x2A; and per delimiter-space, not per indent character: minifying removes it (862→611 Fable, &#x2A;*-29.1%*&#x2A;; 777→456 Sonnet, &#x2A;*-41.3%**). Same for code: a 19-line Python function at 4sp/2sp/tab = 200/200/201 Fable tokens (0.5% spread). One real whitespace cost found: markdown-table column-alignment *padding&#x2A; (many short mid-line runs) costs &#x2A;*+23.6%** over a compact table — padding ≠ indentation.
* **Expected savings:** 29–41% on any pretty-JSON slice entering context (default `jq`, many REST APIs, `python -m json.tool` output). ESTIMATE: if that's 5% of prompt tokens, \~1.0–1.5% of day cost, free.
* **Evidence tier:** T1 (local).
* **Quality risk:** NEUTRAL — models parse minified JSON fine; only human transcript readability suffers. Falsification: field-extraction QA on minified vs pretty payloads.
* **Availability:** CLAUDE-CODE-TODAY (`jq -c`, `separators=(",", ":")`, hook on tool output).
* **Effort to adopt:** minutes.
* **Composability:** the fallback inside tabular re-serialization for non-uniform data; cache-compatible.
* **Validation protocol:** 20 extraction questions against the same payload minified vs pretty; require identical answers; diff token counts with `count_tokens`.

### JSON fields+rows (array-of-arrays) — the no-new-format 80% solution [#json-fieldsrows-array-of-arrays--the-no-new-format-80-solution]

One-line pitch: `{"fields":[...],"rows":[[...]]}` keeps valid JSON and captures most of the CSV win.

* **Layer:** input
* **Mechanism:*&#x2A; key de-duplication expressed inside JSON. Local: 437 Fable tokens vs minified object-JSON 611 (&#x2A;*-28.5%**; Sonnet -28.3%), only 8.7% above CSV (402) and 4.3% above TOON (419). Deflates "you need TOON" marketing: the dominant saving is stating keys once, not the notation — independently corroborated by the ONTO preprint's finding that "eliminating key repetition is the dominant factor" while YAML's punctuation-only savings are 1–6%.
* **Expected savings:** \~28% on uniform-array JSON slices with zero new tooling.
* **Evidence tier:** T1 (local); ONTO preprint T4-leaning-T3 for the mechanism statement (its absolute numbers used the wrong tokenizer — see claims killed).
* **Quality risk:** NEUTRAL-to-RISKY: rows lose per-field self-description; no Claude accuracy benchmark for this exact shape. Degradation = column-index confusion on wide tables.
* **Availability:** CLAUDE-CODE-TODAY.
* **Effort to adopt:** minutes-to-hours (pure JSON transform; downstream parsers keep working).
* **Composability:** drop-in inside JSON-only pipelines where CSV/TOON would break schema validators.
* **Validation protocol:** same 50-question retrieval A/B as tabular re-serialization, with column-count ≥7 to stress positional addressing.

### Tokenizer-aware routing and budgeting: the Opus-4.7+ "30% tax" is an ASCII/English tax [#tokenizer-aware-routing-and-budgeting-the-opus-47-30-tax-is-an-asciienglish-tax]

One-line pitch: the premium is 0% to +132% by content, so route ASCII-heavy bulk work to the old-tokenizer family and never reuse counts across families.

* **Layer:** infra (model routing, budgeting)
* **Mechanism:** official docs : Fable 5/Mythos 5 "use the tokenizer introduced with Claude Opus 4.7, which produces roughly 30% more tokens than models before Claude Opus 4.7 for the same text… don't reuse token counts measured on a model before Claude Opus 4.7." Locally the family split is **exact*&#x2A;: Haiku 4.5 = Sonnet 4.6 and Opus 4.8 = Fable 5 on all probes (54/54 & 86/86 prose; 456/456 & 611/611 minified JSON; 153/153 & 355/355 SCREAMING\_SNAKE). Measured Fable-over-Sonnet premium by content: English prose &#x2A;*+59.3%** (86 vs 54; Phase-0 corpus \~+38%, official average \~30% — strongly sample-dependent), minified JSON +34.0%, CSV +26.8%, Python +15.6% (Phase-0: \~+15%), pretty JSON +10.9%, camelCase +85%, PascalCase +103%, &#x2A;*SCREAMING\_SNAKE +132%*&#x2A; — versus **-1.1% Chinese, 0.0% Japanese, -1.2% digit runs, -0.7% emoji**. Community framing "up to 35% more tokens… the 35% Tokenizer Tax" (UsageBox) is the right direction but still understates the ASCII tail.
* **Expected savings:** ESTIMATE from measured counts x verified list prices: identical English paragraph as input costs (86x$10)/(54x$3) = **5.3x** more on Fable than Sonnet (15.9x vs Haiku); same window holds **37.2% less** English prose (3.66 vs 5.83 chars/token). Routing a 1M-char English doc-triage subagent: Fable ≈ 273k tokens = $2.73 vs Sonnet ≈ 171k = $0.51 uncached input. CJK/numeric workloads: \~0% tokenizer win (route on capability/price only).
* **Evidence tier:** T1 (official docs + local reproduction).
* **Quality risk:** QUALITY-TRADE at routing level (cheaper family = less capable models — that's the real trade, covered elsewhere); NEGATIVE-COST at budgeting level (counting with the right model id is free and prevents 30–130% misestimates). Degradation = subagent task failures; falsify per task class.
* **Availability:** CLAUDE-CODE-TODAY (subagent `model:` selection; `count_tokens` with the target model id).
* **Effort to adopt:** minutes (budgeting) / hours (routing rules by content type).
* **Composability:** rescales every other technique's denominator — minification saves 41% on Sonnet but 29% on Fable; never port format-ROI numbers across the 4.7 boundary. Pairs with subagent delegation patterns in 12-context-architecture.md.
* **Validation protocol:** for each bulk task class, run 20 instances on both families; compare $/accepted-output (LLM-judge + human spot-check 20%); adopt the cheap family where acceptance is within 2pp.

### ID, hash, and timestamp representation [#id-hash-and-timestamp-representation]

One-line pitch: epoch beats ISO by 3x, de-hyphenate UUIDs, short integer IDs beat UUIDs by \~9x — but hex→base64 conversion saves nothing.

* **Layer:** input (content normalization in tool results, logs, DB dumps)
* **Mechanism:*&#x2A; local (Fable): 10 ISO-8601 timestamps = 149 tokens (14.9 each) vs 10 epoch ints = 49 (4.9 each), &#x2A;*-67.1%*&#x2A;. 8 hyphenated UUIDs = 214 vs de-hyphenated 178 (&#x2A;*-16.8%**). Digit-run law: numbers chunk at **ceil(n/3) tokens*&#x2A; (1–3 digits = 1 tok, …, 12 digits = 4) — so an 8-digit surrogate ID = 3 tokens vs a UUID's \~22–27 (&#x2A;*\~-89%**). sha256 as hex = 43.5 tok/digest vs base64 = 43.2 — a tie: base64's 33% character advantage is exactly cancelled by worse merges (1.04 chars/token vs hex 1.49). Any inline binary blob ≈ 1 token/char, \~4x worse than prose — keep binary out of context.
* **Expected savings:*&#x2A; dominated-column arithmetic: a 500-row tool result with one ISO timestamp + one UUID per row carries \~20.9k tokens in those two columns; epoch + 8-digit-int cuts them to \~4.0k (&#x2A;*-81%**). Session-level ESTIMATE: 1–4% of prompt tokens in log/DB-heavy work.
* **Evidence tier:** T1 (local).
* **Quality risk:** **QUALITY-TRADE for epoch** (models do date arithmetic more reliably on ISO; keep ISO when the task reasons about dates) and for truncated hashes (ambiguity). NEUTRAL for UUID de-hyphenation and surrogate IDs. Degradation = wrong relative-time reasoning; falsify with date-math QA.
* **Availability:** CLAUDE-CODE-TODAY (control the SELECT/serializer; or a normalizing hook on tool output).
* **Effort to adopt:** minutes where you own the query; hours for hook-based normalization of third-party tools.
* **Composability:** multiplies inside tabular re-serialization — ID/timestamp columns dominate per-row cost in real datasets.
* **Validation protocol:** 20 date-reasoning questions (ordering, deltas, "how long ago") on epoch vs ISO payloads; adopt epoch only for task classes with zero accuracy drop; verify column token deltas with `count_tokens`.

### Numeric formatting hygiene: no separators, minimal decimals, no scientific notation [#numeric-formatting-hygiene-no-separators-minimal-decimals-no-scientific-notation]

One-line pitch: comma/underscore grouping costs +50.6%; every \~3 excess decimal digits ≈ 1 wasted token per value.

* **Layer:** input
* **Mechanism:*&#x2A; local (20 8-digit ints): plain 79 Fable / 80 Sonnet; comma-grouped 119/120; underscore-grouped 119/120 (&#x2A;*+50.6%*&#x2A; — separators break 3-digit chunking and add their own tokens). Floats: 6dp = 119 vs 2dp = 99 (&#x2A;*-16.8%** for dropping 4 decimals); scientific notation worst at +40.4% over 2dp (1.43 chars/token). Numbers are family-neutral (79 vs 80) — numeric payloads carry no Fable tax.
* **Expected savings:** 33% on any grouped-number column (119→79); \~17% per 4 excess decimals on metric dumps. Session ESTIMATE: \<1% generally, several % for telemetry/metrics-heavy tool output.
* **Evidence tier:** T1 (local).
* **Quality risk:** NEGATIVE-COST for de-grouping (locale separators also occasionally parse as list delimiters); QUALITY-TRADE for precision truncation only if downstream math needs the digits.
* **Availability:** CLAUDE-CODE-TODAY (formatter flags: no locale grouping, `round(x,2)`).
* **Effort to adopt:** minutes.
* **Composability:** stacks inside tabular re-serialization; pairs with output-side rules in 15-output-discipline.md (told-to-emit numbers ride the \~5x output price).
* **Validation protocol:** numeric-comparison QA ("which row has max latency?") on grouped vs plain payloads — expect equal or better accuracy plain; token delta via `count_tokens`.

### Strip Unicode decoration from machine-bound text [#strip-unicode-decoration-from-machine-bound-text]

One-line pitch: emoji ≈ 2.7 tokens each, box-drawing tables +25–39% over ASCII — decoration is the most expensive text per character measured.

* **Layer:** input + output (tool output styling, agent reply style)
* **Mechanism:*&#x2A; local: 50 emoji = 137 Fable / 138 Sonnet tokens (**\~2.74 tokens/emoji** incl. variation selectors — an emoji outprices most English words). Box-drawn table 109 vs identical-layout ASCII `+--|&#x60; table 87 Fable (&#x2A;*+25.3%*&#x2A;; Sonnet 104 vs 75, &#x2A;*+38.7%**). Math-symbol soup \~0.6–0.9 chars/token vs English 3.66–5.83. Counterexample that kills "ASCII always wins": "→ "x20 = 22 Fable tokens vs "-> "x20 = 41 — Unicode arrow is \~half price on Fable (Sonnet: 21 = 21). Merge tables are arbitrary; count, don't assume.
* **Expected savings:** 20–28% of any specific TUI-style table or emoji-rich status block; session-level ESTIMATE 1–3% of tool-result bytes. Output-side: every emoji the agent skips saves \~2.7 tokens at $50/MTok — and this style rule is already in this repo's agent instructions (no-emoji rule), making the saving free here.
* **Evidence tier:** T1 (local).
* **Quality risk:** NEGATIVE-COST — same information, fewer tokens, plus fewer tokenizer-fragile glyphs. Falsification: none needed beyond confirming tools still parse.
* **Availability:** CLAUDE-CODE-TODAY (`NO_COLOR=1`, `--no-unicode` flags on CLIs the agent invokes; a style line in <RepoFile path="AGENTS.md">AGENTS.md</RepoFile> / output style).
* **Effort to adopt:** minutes (env vars + one instruction line).
* **Composability:** output-side wins ride the \~5x output price multiplier (see 15-output-discipline.md); input-side stacks with everything.
* **Validation protocol:** run the agent's usual test/lint/build commands with and without plain-ASCII env flags; diff tool-result token counts across 10 sessions; verify zero parsing regressions.

### Natural-language arbitrage — VERDICT: dead for Claude (QUALITY-TRADE, near-zero savings) [#natural-language-arbitrage--verdict-dead-for-claude-quality-trade-near-zero-savings]

One-line pitch: wenyan/Chinese prompting does not survive BPE; compressed English is the cheapest carrier on both Claude families.

* **Layer:** input + output (language choice)
* **Mechanism:** BPE vocabularies are ASCII/English-optimized: English runs 3.66 (Fable) – 5.83 (Sonnet) chars/token; Chinese 0.93, Japanese 1.10 — multi-byte scripts cost \~1 token/char, erasing their per-character density. Same-meaning paragraph, local: EN 315 chars/**86** Fable tokens; ZH 83 chars/**89** (+3.5% — Chinese costs *more*); JA **107*&#x2A; (+24%). On Sonnet the gap is brutal: EN 54 vs ZH 90 (&#x2A;*+66.7%**). Phase-0 ground truth: wenyan-full = 80.9% char cut but only **56.6% token cut*&#x2A;, wenyan-ultra 74.5% — vs plain-English caveman-ultra 58.5%, so the CJK detour adds little over aggressive English telegraphese. Independent replication (ai.rs, verified live): the famous "28% compression" wenyan claim "was character-counted, not token-counted"; real tiktoken runs put wenyan **+6.4% WORSE** than English shorthand on cl100k\_base (234 vs 220) and only -13.2% on o200k\_base. Mechanism predicted by Petrov et al. (NeurIPS 2023): up to **15x** cross-language tokenization differences; >4x even for byte-level models. One nuance: the Fable tokenizer taxes English (+59% here) but not CJK (±0%), so EN-vs-ZH narrows to parity on new models — narrows, never inverts.
* **Expected savings:** \~0% to negative vs telegraphic English. Folklore claims 75–80%; measured reality 37–57% vs *verbose* English only — which plain English compression beats without the costs below.
* **Evidence tier:** T1 (two independent + Phase-0) + T3 (ai.rs replication) + T2 (Petrov et al., arXiv:2305.15425).
* **Quality risk:** RISKY-to-QUALITY-TRADE: cross-lingual instruction-following degrades, transcripts become unreviewable by most operators, savings small-to-negative. Verdict: do not adopt.
* **Availability:** CLAUDE-CODE-TODAY (trivially — by *not* doing it; use English-register compression instead, covered in 03-prior-art-and-market-scan.md and 15-output-discipline.md).
* **Effort to adopt:** n/a.
* **Composability:** replaced by English style compression; anti-synergy with everything reviewable.
* **Validation protocol:** (for anyone tempted) translate your own 5 longest always-loaded prompts; `count_tokens` both versions on your production model; you will measure ≈0 or negative savings — then stop.

### Identifier casing and ASCII-pattern awareness — observe at design time, never refactor [#identifier-casing-and-ascii-pattern-awareness--observe-at-design-time-never-refactor]

One-line pitch: the same 40 two-word identifiers cost 143 (snake) to 355 (SCREAMING\_SNAKE) Fable tokens, and the cheap casing *flips* between tokenizer generations.

* **Layer:** input (schema/DSL/tool-name design, cost forecasting)
* **Mechanism:*&#x2A; local, identical word pairs: Fable snake\_case 143 = kebab 143 \< camelCase 178 (+24%) \< PascalCase 201 (+41%) \<\< SCREAMING\_SNAKE 355 (&#x2A;*+148%**). Sonnet *inverts* the order: camel 96 ≈ Pascal 99 \< snake 120 \<\< SCREAMING 153. Mixed-case/all-caps ASCII lost merge coverage in the new vocabulary — which is also why the Fable tax peaks at +132% on SCREAMING and +95% on ASCII operator soup. Use only when *defining* new DSLs, config keys, MCP tool/parameter names that will transit context thousands of times: prefer lowercase snake/kebab. Never rename existing code for tokens — grep-ability and convention dominate.
* **Expected savings:** up to 60% of the identifier component in pathological constants-heavy text (355→143); realistically 1–5% of a code-heavy session — ESTIMATE; compounds in always-loaded surfaces (tool schemas, <RepoFile path="AGENTS.md">AGENTS.md</RepoFile>) via cache reads.
* **Evidence tier:** T1 (local).
* **Quality risk:** NEUTRAL for new-schema choice; &#x2A;*RISKY if misread as "rename your codebase"** — do not.
* **Availability:** CLAUDE-CODE-TODAY (for new names only).
* **Effort to adopt:** zero at design time.
* **Composability:** applies to MCP schema design (see 02-baseline-audit.md tool-schema overhead) and any codegen target.
* **Validation protocol:** `count_tokens` the candidate tool schema / DSL sample in both casings under the production model before freezing names; pick the cheaper if otherwise tied.

### Free `count_tokens` as the arbitrage instrument — and the death of cl100k-proxy calculators [#free-count_tokens-as-the-arbitrage-instrument--and-the-death-of-cl100k-proxy-calculators]

One-line pitch: every claim in this file is checkable in seconds for $0; anything not measured this way is folklore.

* **Layer:** infra (measurement)
* **Mechanism:** official docs : token counting is "**free to use**" with its own RPM pool (tier 1/2/3/4 = 100/2,000/4,000/8,000), independent of message-creation limits; supports system prompts, tools, images, PDFs; counts are an estimate that "may differ by a small amount"; "You are not billed for system-added tokens"; previous-turn thinking blocks are ignored in counts; counting never triggers prompt caching; ZDR-eligible. The docs explicitly prescribe the migration method used throughout this file: count the same request twice, once per model id. Anthropic's tokenizer is unpublished, so third-party calculators and papers proxying with cl100k\_base are wrong by construction — the ONTO preprint (arXiv:2604.17512v1) literally claims cl100k\_base "correspond\[s] to the tokenizer used by GPT-4 and Claude models" (false), and since Anthropic's own two families diverge 0–132% by content, no constant factor can rescue a proxy.
* **Expected savings:** enabler, not a saver — prevents 30–130% (family) and 2–15x (language) budgeting errors; gates every other adoption decision here.
* **Evidence tier:** T1 (official docs + used for every number in this file).
* **Quality risk:** NEGATIVE-COST. Residual risk: documented estimate-vs-billing drift is unquantified (gap).
* **Availability:** CLAUDE-CODE-TODAY (API key or the Claude Code OAuth token — `/tmp/ct.py` is a 35-line working reference).
* **Effort to adopt:** minutes.
* **Composability:** prerequisite for all of the above; feeds the regression harness in 31-validation-harness.md.
* **Validation protocol:** self-validating; for billing-drift, compare `count_tokens` predictions vs `usage.input_tokens` on 20 real calls and record the residual.

## Claims killed (folklore that failed measurement) [#claims-killed-folklore-that-failed-measurement]

| Claim                                  | Reality (all unless noted)                                                                                                                                                           |
| -------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| "YAML halves JSON token cost"          | Only vs *pretty* JSON. YAML 628 vs minified JSON 611 on Fable (+2.8% worse); +19.1% worse on Sonnet. ONTO independently: YAML 1–6% below JSON. Minify or go tabular instead.         |
| "Tabs/shallower indent save context"   | 0.0%: 2sp/4sp/tab JSON = 862/862/862 (Fable), 777/777/777 (Sonnet); code 200/200/201. Indent width is free.                                                                          |
| "Wenyan compresses 75–80%"             | Char illusion: Phase-0 80.9% chars → 56.6% tokens; ai.rs tiktoken test: the 28% claim was char-counted; wenyan +6.4% *worse* than English shorthand on cl100k.                       |
| "Prompt in Chinese to save money"      | Same meaning: ZH 89 vs EN 86 Fable (+3.5%); ZH 90 vs EN 54 Sonnet (+67%). Never cheaper.                                                                                             |
| "Base64 beats hex (shorter)"           | Per digest 43.2 vs 43.5 tokens — a tie; base64 tokenizes at 1.04 chars/token. And keep binary out of context entirely.                                                               |
| "cl100k\_base ≈ Claude tokenizer"      | Tokenizer unpublished; Anthropic's own two families diverge 0–132% by content — no constant-factor proxy exists. (ONTO preprint repeats the error verbatim.)                         |
| "Fable tokenizer is uniformly +30–35%" | The average hides the structure: \~0% CJK/digits/emoji → +59% English prose → +132% SCREAMING\_SNAKE. It is an ASCII tax.                                                            |
| "Caveman mode cuts \~75%"              | Phase-0: 58.5% measured token cut at ultra register. Good technique, honest number.                                                                                                  |
| "TOON strictly dominates"              | TOON = CSV +4.2–5.9%; repo's own caveats: compact JSON wins on nested data; TOON scored 0.0% vs CSV's 100% on a truncation-detection probe. Its pitch is CSV-cost *with guardrails*. |
| "Compress with math symbols (∀∑≤)"     | 0.6–0.9 chars/token vs English 3.66–5.83; "∑" costs multiple tokens where "sum" costs \~1. Counterexample proving merge-specificity: "→" (22) is *cheaper* than "->" (41) on Fable.  |
| "Emoji are one token"                  | 137 tokens / 50 emoji ≈ 2.74 each, both families.                                                                                                                                    |

## Open gaps [#open-gaps]

1. All cross-content premiums are n=1 payloads; prose premium varies +38% (Phase-0 corpus) to +59% (this battery) around the official "roughly 30%" — a 1K-file corpus sweep via the free endpoint is the cheap next step. Likewise ZH chars/token read 0.93 here vs \~1.47 in Phase-0 (different text mixes; unresolved).
2. No format-vs-accuracy benchmark on Sonnet 4.6/Fable 5 — TOON's suite only covers claude-haiku; all CSV/TOON quality verdicts on current models are extrapolated.
3. Output-side format arbitrage untested: output bills \~5x input and thinking is 54.8% of output (Phase-0); whether instructing Claude to *emit* CSV/minified JSON harms generation correctness is the highest-value open experiment here (the 862→611→402 input ratios should roughly apply to emitted data).
4. Thinking-token tokenization is unobservable (transcripts redact thinking text) — format effects inside the biggest output bucket can't be verified.
5. Cache interplay shown only as arithmetic (smaller content → cheaper write and 0.1x reads immediately); not yet demonstrated end-to-end in a live session — see 13-caching-exploitation.md.
6. `count_tokens`-vs-billed drift ("may differ by a small amount") unmeasured.
7. All numbers are black-box behavior of an unpublished tokenizer that changed once (Opus 4.7) without a content-stable migration constant — re-measure after every model-family release.

## Verification ledger [#verification-ledger]

| Number                                                                                                                                                                                                                                                                                                                                                        | Source / method                                                                                                                                                         |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Format table (986/862/611/628/544/440/437/419/402 Fable; 901/777/456/543/448/348/327/333/317 Sonnet); 2.45x spread; -34.2% min→CSV; -29.1%/-41.3% pretty→min; TOON +4.2% vs CSV; fields+rows -28.5%; CSV=TSV=PSV                                                                                                                                              | Local battery `/tmp/tok-arb/measure.py` → `/tmp/tok-arb/results.json`, count\_tokens delta method                                                                       |
| Run B figures (945/779/548/317/301; -45% min→CSV; -61% pretty→CSV; TOON +5.3%)                                                                                                                                                                                                                                                                                | Parallel local battery, same method, different 10x7 dataset                                                                                                             |
| Indent-free results (862=862=862; 777=777=777; code 200/200/201; ws runs 2/3 tok); md padding +23.6%                                                                                                                                                                                                                                                          | Same local battery                                                                                                                                                      |
| Family equality (54/54, 86/86, 456/456, 611/611, 153/153, 355/355); overheads 6/6/7/7                                                                                                                                                                                                                                                                         | Same local battery: fable vs opus-4-8, sonnet vs haiku-4-5                                                                                                              |
| Fable premium by content (-1.2% to +132%); EN 3.66 vs 5.83 chars/tok; window -37.2%; 5.3x/15.9x price ratios                                                                                                                                                                                                                                                  | Same local battery + list prices below; ratio arithmetic ESTIMATE shown inline                                                                                          |
| "roughly 30% more tokens"; count\_tokens free; RPM 100/2,000/4,000/8,000; estimate disclaimer; system-token non-billing; prior-turn thinking ignored; no caching; ZDR                                                                                                                                                                                         | [https://platform.claude.com/docs/en/build-with-claude/token-counting](https://platform.claude.com/docs/en/build-with-claude/token-counting) (quotes verified verbatim) |
| Timestamps 149→49 (-67.1%); UUIDs 214→178 (-16.8%); digit law ceil(n/3) (79 = 20x3+19 exact); sha256 hex 43.5 vs b64 43.2; ints 79 vs comma 119 (+50.6%); floats 119/99; sci +40.4%; casing 143/178/201/355 vs 120/96/99/153; emoji 137/50; box 109 vs 87, 104 vs 75; arrows 22 vs 41 (Fable), 21=21 (Sonnet); EN/ZH/JA 86/89/107 (Fable), 54/90/107 (Sonnet) | Same local battery                                                                                                                                                      |
| TOON: 39.9% fewer tokens vs JSON at 76.4% vs 75.0% accuracy; +5.9% vs CSV (67,778 vs 63,997); o200k\_base via gpt-tokenizer; claude-haiku 59.8/57.4/50.5 (CSV on 109/209); nested-data caveat; truncation test CSV 100% vs TOON 0%                                                                                                                            | [https://github.com/toon-format/toon](https://github.com/toon-format/toon)                                                                                              |
| TOON press coverage (reports repo claims, no independent test)                                                                                                                                                                                                                                                                                                | [https://www.infoq.com/news/2025/11/toon-reduce-llm-cost-tokens/](https://www.infoq.com/news/2025/11/toon-reduce-llm-cost-tokens/)                                      |
| Claude Code JSON→TOON hook exists (UserPromptSubmit event, claims 30–60%)                                                                                                                                                                                                                                                                                     | [https://gist.github.com/maman/de31d48cd960366ce9caec32b569d32a](https://gist.github.com/maman/de31d48cd960366ce9caec32b569d32a)                                        |
| "up to 15 times" cross-language; ">4 times" byte-level (Petrov et al., NeurIPS 2023)                                                                                                                                                                                                                                                                          | [https://arxiv.org/abs/2305.15425](https://arxiv.org/abs/2305.15425)                                                                                                    |
| ai.rs wenyan debunk: 28% claim char-counted; +6.4% worse on cl100k (234 vs 220); -13.2% on o200k (191 vs 220)                                                                                                                                                                                                                                                 | [https://ai.rs/ai-developer/classical-chinese-agent-memory-compression](https://ai.rs/ai-developer/classical-chinese-agent-memory-compression)                          |
| ONTO: 46–51% vs JSON; YAML 1–6%; false "cl100k … used by GPT-4 and Claude models" quote; cross-tokenizer validation future work                                                                                                                                                                                                                               | [https://arxiv.org/html/2604.17512v1](https://arxiv.org/html/2604.17512v1)                                                                                              |
| UsageBox "up to 35% more tokens" framing; pricing table Fable $10/$50, Opus 4.8 $5/$25, Sonnet $3/$15, Haiku $1/$5 (matches reference pricing; article)                                                                                                                                                                                                       | [https://usagebox.com/articles/claude-fable-5-pricing-1m-context-real-cost](https://usagebox.com/articles/claude-fable-5-pricing-1m-context-real-cost)                  |
| Phase-0 ground truth: thinking 54.8% of output; wenyan-full 80.9% chars→56.6% tokens; wenyan-ultra 74.5%; caveman-ultra 58.5%; Fable premium \~+15% code / \~+38% prose corpus; day split 32/29/20/17/2                                                                                                                                                       | Local Phase-0 measurement (see 01-economics-and-measurement.md)                                                                                                         |
| Structured-data share of prompt tokens (\~15%); session-level 1–5% ranges                                                                                                                                                                                                                                                                                     | ESTIMATE — assumptions and arithmetic stated inline                                                                                                                     |
