11 — Tokenizer arbitrage
11 — Tokenizer arbitrage
TL;DR
- Claude bills BPE tokens, not characters. Re-encoding the same information changes cost with zero information loss. Biggest measured lever: serialization format — the same 10-record/7-field dataset spans 986 tokens (pretty XML) to 402 (CSV) on
claude-fable-5, a 2.45x spread (local run A; a parallel local run B on a different dataset measured 3.1x). Minified JSON → CSV is -34%, pretty JSON → minified is -29% (Fable) / -41% (Sonnet). - Anthropic runs exactly two live tokenizer families: official docs confirm Fable 5/Mythos 5 use the Opus-4.7 tokenizer, "roughly 30% more tokens" than older models. Locally the premium is an ASCII/English tax: ~0% on CJK/digits/emoji, +15.6% Python, +34% minified JSON, +59% English prose, +132% SCREAMING_SNAKE. Family equality is exact: Haiku 4.5 = Sonnet 4.6 and Opus 4.8 = Fable 5 on every probe.
- Effective price ratio for identical English prose: Fable input costs 5.3x Sonnet (not the 3.3x list ratio) and the same context window holds 37% less English on the new tokenizer.
- Most character-level folklore is dead on measurement: indent width is literally free (862=862=862 tokens for 2sp/4sp/tab JSON), CSV=TSV=PSV to the exact token, YAML loses to minified JSON, hex≈base64 per byte, wenyan/Chinese never beats English. Cheap real wins: epoch timestamps -67%, no thousands separators (+50.6% tax), de-hyphenated UUIDs -17%, ASCII tables vs box-drawing -20 to -28%.
- Everything here is testable in minutes for $0 via the free
count_tokensendpoint (own rate pool, 100–8,000 RPM by tier) — the only authoritative counter; all cl100k-proxy "Claude calculators" are wrong by construction.
Method
All "local" numbers: against the live count_tokens API via /tmp/ct.py (battery script /tmp/tok-arb/measure.py, raw results /tmp/tok-arb/results.json). Delta method: tokens(payload) = count(payload as single user message) − overhead, with overhead calibrated per model as count("a") − 1 (measured: 6 on claude-fable-5 and claude-opus-4-8, 7 on claude-sonnet-4-6 and claude-haiku-4-5). Caveats: (a) docs say counts are an estimate that "may differ by a small amount" from billing; (b) payloads are n=1 samples — a parallel local battery the same day ("run B", different 10x7 dataset, same method) produced the same orderings with shifted magnitudes, quoted below where it widens the range; (c) isolated 1–12 digit probes on Sonnet read a constant +1 over the ceil(n/3) law (single-message boundary artifact); the law is confirmed exactly in-context (20 8-digit ints + 19 newlines = 79 Fable tokens = 20x3+19).
Dollar arithmetic uses the modeled heavy-day profile from 01-economics-and-measurement.md: ~$22/day at Fable 5 list prices, split 32% cache reads / 29% cache writes / 20% thinking / 17% visible output / 2% uncached input (local measurement).
The measured format table (10 records x 7 fields: id, name, email, role, region, ISO date, float score)
| Format | chars | Fable 5 tok | Sonnet 4.6 tok | vs minified JSON (Fable) |
|---|---|---|---|---|
| XML, pretty | 2,222 | 986 | 901 | +61% |
| JSON indent=2 | 1,824 | 862 | 777 | +41% |
| JSON indent=4 | 2,144 | 862 | 777 | +41% |
| JSON indent=tab | 1,664 | 862 | 777 | +41% |
| YAML (block) | 1,291 | 628 | 543 | +2.8% |
| JSON minified | 1,343 | 611 | 456 | baseline |
| Markdown table, padded | 1,103 | 544 | 448 | -11% |
| Markdown table, compact | 773 | 440 | 348 | -28% |
JSON {"fields":[],"rows":[[]]} | 877 | 437 | 327 | -28.5% |
| TOON | 755 | 419 | 333 | -31.4% |
| CSV / TSV / PSV | 721 | 402 / 402 / 402 | 317 / 317 / 317 | -34.2% |
T1, local run A. Run B (different field mix, shorter values → key overhead dominates more): XML 945, pretty 779, min 548, TOON 317, CSV 301 — i.e. min→CSV ranged -34% to -45% across the two samples. Savings scale with the key:value length ratio of your data; measure yours (see last technique).
Tabular re-serialization: JSON record arrays → CSV/TSV, or TOON when you want guardrails
One-line pitch: state the schema once, not per record — the biggest input-format lever measured.
- Layer: input (tool results, context files, RAG payloads)
- Mechanism: JSON repeats every key plus quotes/braces per record; BPE charges for all of it. Column-once formats pay for keys a single time. Local: minified JSON 611 → CSV 402 Fable (-34.2%; Sonnet -30.5%; run B -45%); pretty JSON → CSV -53.4% (run B -61%). Delimiter is a pure no-op: CSV=TSV=PSV exactly (402/402/402 Fable, 317/317/317 Sonnet). TOON costs +4.2% over CSV locally (run B +5.3%; TOON repo's own figure +5.9% on o200k_base, 67,778 vs 63,997 tokens) while adding
[N]length and{fields}schema guardrails. - Expected savings: ESTIMATE on the modeled profile: if structured record data is ~15% of prompt-side tokens (assumption), a 34% cut on that slice saves 0.34 x 0.15 x (32%+29%+2% of $22) ≈ $0.71/day ≈ 3.2% of day cost; rises toward 5% where pretty JSON was entering context.
- Evidence tier: T1 token counts (local). Accuracy: T3 — TOON repo's 209-question, 4-model benchmark (o200k counts): TOON 76.4% vs JSON 75.0% overall at 39.9% fewer tokens; claude-haiku row TOON 59.8% / JSON 57.4% / CSV 50.5% (CSV evaluable on only 109/209 flat questions). No current-generation Claude accuracy benchmark exists (gap).
- Quality risk: NEUTRAL for TOON; RISKY for bare CSV on retrieval-heavy tasks — repo benchmark shows CSV materially worst, and in its truncation-detection test CSV scored 100% vs TOON 0.0% the other way (TOON's declared
[N]can mask silently dropped rows it was designed to catch — read the fine print both ways). Degradation manifests as wrong-row lookups. - Availability: CLAUDE-CODE-TODAY — transform at the MCP/tool boundary or via a hook; a community Claude Code UserPromptSubmit hook already auto-converts JSON in prompts to TOON (gist).
- Effort to adopt: hours (hook or serializer change); only for uniform arrays — repo's own caveat: compact JSON often wins on deeply nested/non-uniform data.
- Composability: multiplies through caching (smaller static content = cheaper cache write AND 0.1x reads thereafter; see 13-caching-exploitation.md); stacks with numeric/ID normalization below (those dominate per-row cost); orthogonal to model routing.
- Validation protocol: take 50 retrieval questions over your own top-3 highest-volume tool outputs; run each task with JSON vs CSV vs TOON payloads on your production model; accept the cheapest format whose accuracy drop is <1pp; record token delta via
count_tokens.
Minify JSON — and stop arguing about indentation, which is free
One-line pitch: strip pretty-printing for -29% (Fable) / -41% (Sonnet); indent width itself costs 0.
- Layer: input
- Mechanism: local: JSON at indent=2 / indent=4 / indent=tab is token-identical (862=862=862 Fable; 777=777=777 Sonnet) despite a 480-char spread — whitespace runs absorb into ~1 token regardless of width (8 spaces+x = 2 tokens, 24 spaces+x = 3 tokens, both families). The pretty-print tax is per line and per delimiter-space, not per indent character: minifying removes it (862→611 Fable, -29.1%; 777→456 Sonnet, -41.3%). Same for code: a 19-line Python function at 4sp/2sp/tab = 200/200/201 Fable tokens (0.5% spread). One real whitespace cost found: markdown-table column-alignment padding (many short mid-line runs) costs +23.6% over a compact table — padding ≠ indentation.
- Expected savings: 29–41% on any pretty-JSON slice entering context (default
jq, many REST APIs,python -m json.tooloutput). ESTIMATE: if that's 5% of prompt tokens, ~1.0–1.5% of day cost, free. - Evidence tier: T1 (local).
- Quality risk: NEUTRAL — models parse minified JSON fine; only human transcript readability suffers. Falsification: field-extraction QA on minified vs pretty payloads.
- Availability: CLAUDE-CODE-TODAY (
jq -c,separators=(",", ":"), hook on tool output). - Effort to adopt: minutes.
- Composability: the fallback inside tabular re-serialization for non-uniform data; cache-compatible.
- Validation protocol: 20 extraction questions against the same payload minified vs pretty; require identical answers; diff token counts with
count_tokens.
JSON fields+rows (array-of-arrays) — the no-new-format 80% solution
One-line pitch: {"fields":[...],"rows":[[...]]} keeps valid JSON and captures most of the CSV win.
- Layer: input
- Mechanism: key de-duplication expressed inside JSON. Local: 437 Fable tokens vs minified object-JSON 611 (-28.5%; Sonnet -28.3%), only 8.7% above CSV (402) and 4.3% above TOON (419). Deflates "you need TOON" marketing: the dominant saving is stating keys once, not the notation — independently corroborated by the ONTO preprint's finding that "eliminating key repetition is the dominant factor" while YAML's punctuation-only savings are 1–6%.
- Expected savings: ~28% on uniform-array JSON slices with zero new tooling.
- Evidence tier: T1 (local); ONTO preprint T4-leaning-T3 for the mechanism statement (its absolute numbers used the wrong tokenizer — see claims killed).
- Quality risk: NEUTRAL-to-RISKY: rows lose per-field self-description; no Claude accuracy benchmark for this exact shape. Degradation = column-index confusion on wide tables.
- Availability: CLAUDE-CODE-TODAY.
- Effort to adopt: minutes-to-hours (pure JSON transform; downstream parsers keep working).
- Composability: drop-in inside JSON-only pipelines where CSV/TOON would break schema validators.
- Validation protocol: same 50-question retrieval A/B as tabular re-serialization, with column-count ≥7 to stress positional addressing.
Tokenizer-aware routing and budgeting: the Opus-4.7+ "30% tax" is an ASCII/English tax
One-line pitch: the premium is 0% to +132% by content, so route ASCII-heavy bulk work to the old-tokenizer family and never reuse counts across families.
- Layer: infra (model routing, budgeting)
- Mechanism: official docs : Fable 5/Mythos 5 "use the tokenizer introduced with Claude Opus 4.7, which produces roughly 30% more tokens than models before Claude Opus 4.7 for the same text… don't reuse token counts measured on a model before Claude Opus 4.7." Locally the family split is exact: Haiku 4.5 = Sonnet 4.6 and Opus 4.8 = Fable 5 on all probes (54/54 & 86/86 prose; 456/456 & 611/611 minified JSON; 153/153 & 355/355 SCREAMING_SNAKE). Measured Fable-over-Sonnet premium by content: English prose +59.3% (86 vs 54; Phase-0 corpus ~+38%, official average ~30% — strongly sample-dependent), minified JSON +34.0%, CSV +26.8%, Python +15.6% (Phase-0: ~+15%), pretty JSON +10.9%, camelCase +85%, PascalCase +103%, SCREAMING_SNAKE +132% — versus -1.1% Chinese, 0.0% Japanese, -1.2% digit runs, -0.7% emoji. Community framing "up to 35% more tokens… the 35% Tokenizer Tax" (UsageBox) is the right direction but still understates the ASCII tail.
- Expected savings: ESTIMATE from measured counts x verified list prices: identical English paragraph as input costs (86x$10)/(54x$3) = 5.3x more on Fable than Sonnet (15.9x vs Haiku); same window holds 37.2% less English prose (3.66 vs 5.83 chars/token). Routing a 1M-char English doc-triage subagent: Fable ≈ 273k tokens = $2.73 vs Sonnet ≈ 171k = $0.51 uncached input. CJK/numeric workloads: ~0% tokenizer win (route on capability/price only).
- Evidence tier: T1 (official docs + local reproduction).
- Quality risk: QUALITY-TRADE at routing level (cheaper family = less capable models — that's the real trade, covered elsewhere); NEGATIVE-COST at budgeting level (counting with the right model id is free and prevents 30–130% misestimates). Degradation = subagent task failures; falsify per task class.
- Availability: CLAUDE-CODE-TODAY (subagent
model:selection;count_tokenswith the target model id). - Effort to adopt: minutes (budgeting) / hours (routing rules by content type).
- Composability: rescales every other technique's denominator — minification saves 41% on Sonnet but 29% on Fable; never port format-ROI numbers across the 4.7 boundary. Pairs with subagent delegation patterns in 12-context-architecture.md.
- Validation protocol: for each bulk task class, run 20 instances on both families; compare $/accepted-output (LLM-judge + human spot-check 20%); adopt the cheap family where acceptance is within 2pp.
ID, hash, and timestamp representation
One-line pitch: epoch beats ISO by 3x, de-hyphenate UUIDs, short integer IDs beat UUIDs by ~9x — but hex→base64 conversion saves nothing.
- Layer: input (content normalization in tool results, logs, DB dumps)
- Mechanism: local (Fable): 10 ISO-8601 timestamps = 149 tokens (14.9 each) vs 10 epoch ints = 49 (4.9 each), -67.1%. 8 hyphenated UUIDs = 214 vs de-hyphenated 178 (-16.8%). Digit-run law: numbers chunk at ceil(n/3) tokens (1–3 digits = 1 tok, …, 12 digits = 4) — so an 8-digit surrogate ID = 3 tokens vs a UUID's ~22–27 (~-89%). sha256 as hex = 43.5 tok/digest vs base64 = 43.2 — a tie: base64's 33% character advantage is exactly cancelled by worse merges (1.04 chars/token vs hex 1.49). Any inline binary blob ≈ 1 token/char, ~4x worse than prose — keep binary out of context.
- Expected savings: dominated-column arithmetic: a 500-row tool result with one ISO timestamp + one UUID per row carries ~20.9k tokens in those two columns; epoch + 8-digit-int cuts them to ~4.0k (-81%). Session-level ESTIMATE: 1–4% of prompt tokens in log/DB-heavy work.
- Evidence tier: T1 (local).
- Quality risk: QUALITY-TRADE for epoch (models do date arithmetic more reliably on ISO; keep ISO when the task reasons about dates) and for truncated hashes (ambiguity). NEUTRAL for UUID de-hyphenation and surrogate IDs. Degradation = wrong relative-time reasoning; falsify with date-math QA.
- Availability: CLAUDE-CODE-TODAY (control the SELECT/serializer; or a normalizing hook on tool output).
- Effort to adopt: minutes where you own the query; hours for hook-based normalization of third-party tools.
- Composability: multiplies inside tabular re-serialization — ID/timestamp columns dominate per-row cost in real datasets.
- Validation protocol: 20 date-reasoning questions (ordering, deltas, "how long ago") on epoch vs ISO payloads; adopt epoch only for task classes with zero accuracy drop; verify column token deltas with
count_tokens.
Numeric formatting hygiene: no separators, minimal decimals, no scientific notation
One-line pitch: comma/underscore grouping costs +50.6%; every ~3 excess decimal digits ≈ 1 wasted token per value.
- Layer: input
- Mechanism: local (20 8-digit ints): plain 79 Fable / 80 Sonnet; comma-grouped 119/120; underscore-grouped 119/120 (+50.6% — separators break 3-digit chunking and add their own tokens). Floats: 6dp = 119 vs 2dp = 99 (-16.8% for dropping 4 decimals); scientific notation worst at +40.4% over 2dp (1.43 chars/token). Numbers are family-neutral (79 vs 80) — numeric payloads carry no Fable tax.
- Expected savings: 33% on any grouped-number column (119→79); ~17% per 4 excess decimals on metric dumps. Session ESTIMATE: <1% generally, several % for telemetry/metrics-heavy tool output.
- Evidence tier: T1 (local).
- Quality risk: NEGATIVE-COST for de-grouping (locale separators also occasionally parse as list delimiters); QUALITY-TRADE for precision truncation only if downstream math needs the digits.
- Availability: CLAUDE-CODE-TODAY (formatter flags: no locale grouping,
round(x,2)). - Effort to adopt: minutes.
- Composability: stacks inside tabular re-serialization; pairs with output-side rules in 15-output-discipline.md (told-to-emit numbers ride the ~5x output price).
- Validation protocol: numeric-comparison QA ("which row has max latency?") on grouped vs plain payloads — expect equal or better accuracy plain; token delta via
count_tokens.
Strip Unicode decoration from machine-bound text
One-line pitch: emoji ≈ 2.7 tokens each, box-drawing tables +25–39% over ASCII — decoration is the most expensive text per character measured.
- Layer: input + output (tool output styling, agent reply style)
- Mechanism: local: 50 emoji = 137 Fable / 138 Sonnet tokens (~2.74 tokens/emoji incl. variation selectors — an emoji outprices most English words). Box-drawn table 109 vs identical-layout ASCII
+--|table 87 Fable (+25.3%; Sonnet 104 vs 75, +38.7%). Math-symbol soup ~0.6–0.9 chars/token vs English 3.66–5.83. Counterexample that kills "ASCII always wins": "→ "x20 = 22 Fable tokens vs "-> "x20 = 41 — Unicode arrow is ~half price on Fable (Sonnet: 21 = 21). Merge tables are arbitrary; count, don't assume. - Expected savings: 20–28% of any specific TUI-style table or emoji-rich status block; session-level ESTIMATE 1–3% of tool-result bytes. Output-side: every emoji the agent skips saves ~2.7 tokens at $50/MTok — and this style rule is already in this repo's agent instructions (no-emoji rule), making the saving free here.
- Evidence tier: T1 (local).
- Quality risk: NEGATIVE-COST — same information, fewer tokens, plus fewer tokenizer-fragile glyphs. Falsification: none needed beyond confirming tools still parse.
- Availability: CLAUDE-CODE-TODAY (
NO_COLOR=1,--no-unicodeflags on CLIs the agent invokes; a style line inAGENTS.md/ output style). - Effort to adopt: minutes (env vars + one instruction line).
- Composability: output-side wins ride the ~5x output price multiplier (see 15-output-discipline.md); input-side stacks with everything.
- Validation protocol: run the agent's usual test/lint/build commands with and without plain-ASCII env flags; diff tool-result token counts across 10 sessions; verify zero parsing regressions.
Natural-language arbitrage — VERDICT: dead for Claude (QUALITY-TRADE, near-zero savings)
One-line pitch: wenyan/Chinese prompting does not survive BPE; compressed English is the cheapest carrier on both Claude families.
- Layer: input + output (language choice)
- Mechanism: BPE vocabularies are ASCII/English-optimized: English runs 3.66 (Fable) – 5.83 (Sonnet) chars/token; Chinese 0.93, Japanese 1.10 — multi-byte scripts cost ~1 token/char, erasing their per-character density. Same-meaning paragraph, local: EN 315 chars/86 Fable tokens; ZH 83 chars/89 (+3.5% — Chinese costs more); JA 107 (+24%). On Sonnet the gap is brutal: EN 54 vs ZH 90 (+66.7%). Phase-0 ground truth: wenyan-full = 80.9% char cut but only 56.6% token cut, wenyan-ultra 74.5% — vs plain-English caveman-ultra 58.5%, so the CJK detour adds little over aggressive English telegraphese. Independent replication (ai.rs, verified live): the famous "28% compression" wenyan claim "was character-counted, not token-counted"; real tiktoken runs put wenyan +6.4% WORSE than English shorthand on cl100k_base (234 vs 220) and only -13.2% on o200k_base. Mechanism predicted by Petrov et al. (NeurIPS 2023): up to 15x cross-language tokenization differences; >4x even for byte-level models. One nuance: the Fable tokenizer taxes English (+59% here) but not CJK (±0%), so EN-vs-ZH narrows to parity on new models — narrows, never inverts.
- Expected savings: ~0% to negative vs telegraphic English. Folklore claims 75–80%; measured reality 37–57% vs verbose English only — which plain English compression beats without the costs below.
- Evidence tier: T1 (two independent + Phase-0) + T3 (ai.rs replication) + T2 (Petrov et al., arXiv:2305.15425).
- Quality risk: RISKY-to-QUALITY-TRADE: cross-lingual instruction-following degrades, transcripts become unreviewable by most operators, savings small-to-negative. Verdict: do not adopt.
- Availability: CLAUDE-CODE-TODAY (trivially — by not doing it; use English-register compression instead, covered in 03-prior-art-and-market-scan.md and 15-output-discipline.md).
- Effort to adopt: n/a.
- Composability: replaced by English style compression; anti-synergy with everything reviewable.
- Validation protocol: (for anyone tempted) translate your own 5 longest always-loaded prompts;
count_tokensboth versions on your production model; you will measure ≈0 or negative savings — then stop.
Identifier casing and ASCII-pattern awareness — observe at design time, never refactor
One-line pitch: the same 40 two-word identifiers cost 143 (snake) to 355 (SCREAMING_SNAKE) Fable tokens, and the cheap casing flips between tokenizer generations.
- Layer: input (schema/DSL/tool-name design, cost forecasting)
- Mechanism: local, identical word pairs: Fable snake_case 143 = kebab 143 < camelCase 178 (+24%) < PascalCase 201 (+41%) << SCREAMING_SNAKE 355 (+148%). Sonnet inverts the order: camel 96 ≈ Pascal 99 < snake 120 << SCREAMING 153. Mixed-case/all-caps ASCII lost merge coverage in the new vocabulary — which is also why the Fable tax peaks at +132% on SCREAMING and +95% on ASCII operator soup. Use only when defining new DSLs, config keys, MCP tool/parameter names that will transit context thousands of times: prefer lowercase snake/kebab. Never rename existing code for tokens — grep-ability and convention dominate.
- Expected savings: up to 60% of the identifier component in pathological constants-heavy text (355→143); realistically 1–5% of a code-heavy session — ESTIMATE; compounds in always-loaded surfaces (tool schemas,
AGENTS.md) via cache reads. - Evidence tier: T1 (local).
- Quality risk: NEUTRAL for new-schema choice; RISKY if misread as "rename your codebase" — do not.
- Availability: CLAUDE-CODE-TODAY (for new names only).
- Effort to adopt: zero at design time.
- Composability: applies to MCP schema design (see 02-baseline-audit.md tool-schema overhead) and any codegen target.
- Validation protocol:
count_tokensthe candidate tool schema / DSL sample in both casings under the production model before freezing names; pick the cheaper if otherwise tied.
Free count_tokens as the arbitrage instrument — and the death of cl100k-proxy calculators
One-line pitch: every claim in this file is checkable in seconds for $0; anything not measured this way is folklore.
- Layer: infra (measurement)
- Mechanism: official docs : token counting is "free to use" with its own RPM pool (tier 1/2/3/4 = 100/2,000/4,000/8,000), independent of message-creation limits; supports system prompts, tools, images, PDFs; counts are an estimate that "may differ by a small amount"; "You are not billed for system-added tokens"; previous-turn thinking blocks are ignored in counts; counting never triggers prompt caching; ZDR-eligible. The docs explicitly prescribe the migration method used throughout this file: count the same request twice, once per model id. Anthropic's tokenizer is unpublished, so third-party calculators and papers proxying with cl100k_base are wrong by construction — the ONTO preprint (arXiv:2604.17512v1) literally claims cl100k_base "correspond[s] to the tokenizer used by GPT-4 and Claude models" (false), and since Anthropic's own two families diverge 0–132% by content, no constant factor can rescue a proxy.
- Expected savings: enabler, not a saver — prevents 30–130% (family) and 2–15x (language) budgeting errors; gates every other adoption decision here.
- Evidence tier: T1 (official docs + used for every number in this file).
- Quality risk: NEGATIVE-COST. Residual risk: documented estimate-vs-billing drift is unquantified (gap).
- Availability: CLAUDE-CODE-TODAY (API key or the Claude Code OAuth token —
/tmp/ct.pyis a 35-line working reference). - Effort to adopt: minutes.
- Composability: prerequisite for all of the above; feeds the regression harness in 31-validation-harness.md.
- Validation protocol: self-validating; for billing-drift, compare
count_tokenspredictions vsusage.input_tokenson 20 real calls and record the residual.
Claims killed (folklore that failed measurement)
| Claim | Reality (all unless noted) |
|---|---|
| "YAML halves JSON token cost" | Only vs pretty JSON. YAML 628 vs minified JSON 611 on Fable (+2.8% worse); +19.1% worse on Sonnet. ONTO independently: YAML 1–6% below JSON. Minify or go tabular instead. |
| "Tabs/shallower indent save context" | 0.0%: 2sp/4sp/tab JSON = 862/862/862 (Fable), 777/777/777 (Sonnet); code 200/200/201. Indent width is free. |
| "Wenyan compresses 75–80%" | Char illusion: Phase-0 80.9% chars → 56.6% tokens; ai.rs tiktoken test: the 28% claim was char-counted; wenyan +6.4% worse than English shorthand on cl100k. |
| "Prompt in Chinese to save money" | Same meaning: ZH 89 vs EN 86 Fable (+3.5%); ZH 90 vs EN 54 Sonnet (+67%). Never cheaper. |
| "Base64 beats hex (shorter)" | Per digest 43.2 vs 43.5 tokens — a tie; base64 tokenizes at 1.04 chars/token. And keep binary out of context entirely. |
| "cl100k_base ≈ Claude tokenizer" | Tokenizer unpublished; Anthropic's own two families diverge 0–132% by content — no constant-factor proxy exists. (ONTO preprint repeats the error verbatim.) |
| "Fable tokenizer is uniformly +30–35%" | The average hides the structure: ~0% CJK/digits/emoji → +59% English prose → +132% SCREAMING_SNAKE. It is an ASCII tax. |
| "Caveman mode cuts ~75%" | Phase-0: 58.5% measured token cut at ultra register. Good technique, honest number. |
| "TOON strictly dominates" | TOON = CSV +4.2–5.9%; repo's own caveats: compact JSON wins on nested data; TOON scored 0.0% vs CSV's 100% on a truncation-detection probe. Its pitch is CSV-cost with guardrails. |
| "Compress with math symbols (∀∑≤)" | 0.6–0.9 chars/token vs English 3.66–5.83; "∑" costs multiple tokens where "sum" costs ~1. Counterexample proving merge-specificity: "→" (22) is cheaper than "->" (41) on Fable. |
| "Emoji are one token" | 137 tokens / 50 emoji ≈ 2.74 each, both families. |
Open gaps
- All cross-content premiums are n=1 payloads; prose premium varies +38% (Phase-0 corpus) to +59% (this battery) around the official "roughly 30%" — a 1K-file corpus sweep via the free endpoint is the cheap next step. Likewise ZH chars/token read 0.93 here vs ~1.47 in Phase-0 (different text mixes; unresolved).
- No format-vs-accuracy benchmark on Sonnet 4.6/Fable 5 — TOON's suite only covers claude-haiku; all CSV/TOON quality verdicts on current models are extrapolated.
- Output-side format arbitrage untested: output bills ~5x input and thinking is 54.8% of output (Phase-0); whether instructing Claude to emit CSV/minified JSON harms generation correctness is the highest-value open experiment here (the 862→611→402 input ratios should roughly apply to emitted data).
- Thinking-token tokenization is unobservable (transcripts redact thinking text) — format effects inside the biggest output bucket can't be verified.
- Cache interplay shown only as arithmetic (smaller content → cheaper write and 0.1x reads immediately); not yet demonstrated end-to-end in a live session — see 13-caching-exploitation.md.
count_tokens-vs-billed drift ("may differ by a small amount") unmeasured.- All numbers are black-box behavior of an unpublished tokenizer that changed once (Opus 4.7) without a content-stable migration constant — re-measure after every model-family release.
Verification ledger
| Number | Source / method |
|---|---|
| Format table (986/862/611/628/544/440/437/419/402 Fable; 901/777/456/543/448/348/327/333/317 Sonnet); 2.45x spread; -34.2% min→CSV; -29.1%/-41.3% pretty→min; TOON +4.2% vs CSV; fields+rows -28.5%; CSV=TSV=PSV | Local battery /tmp/tok-arb/measure.py → /tmp/tok-arb/results.json, count_tokens delta method |
| Run B figures (945/779/548/317/301; -45% min→CSV; -61% pretty→CSV; TOON +5.3%) | Parallel local battery, same method, different 10x7 dataset |
| Indent-free results (862=862=862; 777=777=777; code 200/200/201; ws runs 2/3 tok); md padding +23.6% | Same local battery |
| Family equality (54/54, 86/86, 456/456, 611/611, 153/153, 355/355); overheads 6/6/7/7 | Same local battery: fable vs opus-4-8, sonnet vs haiku-4-5 |
| Fable premium by content (-1.2% to +132%); EN 3.66 vs 5.83 chars/tok; window -37.2%; 5.3x/15.9x price ratios | Same local battery + list prices below; ratio arithmetic ESTIMATE shown inline |
| "roughly 30% more tokens"; count_tokens free; RPM 100/2,000/4,000/8,000; estimate disclaimer; system-token non-billing; prior-turn thinking ignored; no caching; ZDR | https://platform.claude.com/docs/en/build-with-claude/token-counting (quotes verified verbatim) |
| Timestamps 149→49 (-67.1%); UUIDs 214→178 (-16.8%); digit law ceil(n/3) (79 = 20x3+19 exact); sha256 hex 43.5 vs b64 43.2; ints 79 vs comma 119 (+50.6%); floats 119/99; sci +40.4%; casing 143/178/201/355 vs 120/96/99/153; emoji 137/50; box 109 vs 87, 104 vs 75; arrows 22 vs 41 (Fable), 21=21 (Sonnet); EN/ZH/JA 86/89/107 (Fable), 54/90/107 (Sonnet) | Same local battery |
| TOON: 39.9% fewer tokens vs JSON at 76.4% vs 75.0% accuracy; +5.9% vs CSV (67,778 vs 63,997); o200k_base via gpt-tokenizer; claude-haiku 59.8/57.4/50.5 (CSV on 109/209); nested-data caveat; truncation test CSV 100% vs TOON 0% | https://github.com/toon-format/toon |
| TOON press coverage (reports repo claims, no independent test) | https://www.infoq.com/news/2025/11/toon-reduce-llm-cost-tokens/ |
| Claude Code JSON→TOON hook exists (UserPromptSubmit event, claims 30–60%) | https://gist.github.com/maman/de31d48cd960366ce9caec32b569d32a |
| "up to 15 times" cross-language; ">4 times" byte-level (Petrov et al., NeurIPS 2023) | https://arxiv.org/abs/2305.15425 |
| ai.rs wenyan debunk: 28% claim char-counted; +6.4% worse on cl100k (234 vs 220); -13.2% on o200k (191 vs 220) | https://ai.rs/ai-developer/classical-chinese-agent-memory-compression |
| ONTO: 46–51% vs JSON; YAML 1–6%; false "cl100k … used by GPT-4 and Claude models" quote; cross-tokenizer validation future work | https://arxiv.org/html/2604.17512v1 |
| UsageBox "up to 35% more tokens" framing; pricing table Fable $10/$50, Opus 4.8 $5/$25, Sonnet $3/$15, Haiku $1/$5 (matches reference pricing; article) | https://usagebox.com/articles/claude-fable-5-pricing-1m-context-real-cost |
| Phase-0 ground truth: thinking 54.8% of output; wenyan-full 80.9% chars→56.6% tokens; wenyan-ultra 74.5%; caveman-ultra 58.5%; Fable premium ~+15% code / ~+38% prose corpus; day split 32/29/20/17/2 | Local Phase-0 measurement (see 01-economics-and-measurement.md) |
| Structured-data share of prompt tokens (~15%); session-level 1–5% ranges | ESTIMATE — assumptions and arithmetic stated inline |