42 — Multimodal token economics: images, screenshots, PDFs
42 — Multimodal token economics: images, screenshots, PDFs
Volume II area file for blind spot 2 (multimodal), which Volume
I left near-blank: a grep of all 4,303 dossier lines for image/screenshot/vision/PDF terms returns
zero substantive hits, the lone adjacent fact being one "~125k tokens per 500 kB PDF" page-size
estimate in 03-prior-art-and-market-scan.md:267 and 18-provider-features.md:165. Every number
below is either a live primary-doc quote or a local count_tokens
measurement with the method shown. Plain-language writing rules per Volume I §10.
TL;DR
- An image costs
⌈width/28⌉ × ⌈height/28⌉visual tokens (28-pixel patches), billed at the model's normal input price — confirmed by localcount_tokensreproducing Anthropic's published table to within the ~6-token message wrapper (e.g. 1000×1000 px = 1,296 visual tokens, measured 1,304). Volume I never stated the formula. - The per-image cap differs ~3× across the two tokenizer families, and it is a routing lever. Docs state Opus 4.8 / Fable 5 cap around 4,784 tokens (≤2,576 px edge) and Sonnet 4.6 / Haiku 4.5 around 1,568 tokens (≤1,568 px). Local counts include wrapper/envelope effects and land around ~4,760 / ~1,520–1,570, i.e. ~3.0–3.1×. Haiku 4.5 ≡ Sonnet 4.6 on images (identical counts), extending Volume I's tokenizer-family equality to vision.
- A PDF costs ~1,500–3,000 tokens per page and roughly doubles the cost of the same text — because each page is billed as a rendered page-image plus extracted text. Measured: the same 50 text lines cost 1,605 tokens raw (Opus) but 3,182 tokens as a 1-page PDF (1.98×); on Sonnet 1,206 → 2,780 (2.30×). Per-page cost is linear and family-divergent: 3,152 tok/page (Opus) vs ~2,750 (Sonnet) for a dense page.
- A screenshot is a token bomb for textual content and a bargain only for visual content. A dense code screenful is 593–765 text tokens but a screenshot of it costs ~1,520–1,570 (Sonnet) to ~4,760 (Opus) — 2–6× more, with worse fidelity. Screenshots win only when the information is inherently visual (rendered layout, charts, a visual diff) or when the text equivalent would exceed the cap.
- Net new levers for the stack: route vision work to Sonnet/Haiku (~3.0–3.1× image saving),
downsample screenshots client-side to the family cap, prefer text/markdown over screenshots and
PDFs, and crop to the region of interest. All are
CLAUDE-CODE-TODAYor hook-level and NEGATIVE-COST-to-NEUTRAL on quality. None appear in Volume I.
Pricing and the modeled session profile are inherited from 01-economics-and-measurement.md. The
Fable-family tokenizer is measured on claude-opus-4-8 (its documented tokenizer twin) because
count_tokens rejects claude-fable-5 (see 40).
Method
No image or PDF tooling exists on this machine (no PIL, ImageMagick, or qpdf), so test assets were generated from the Python standard library:
- PNGs at controlled dimensions via
zlib(/tmp/mkpng.py): a valid RGB PNG with a gradient (non-degenerate) so token cost reflects dimensions, not blank-image special cases. - PDFs with byte-offset-correct xref tables via
zlib/struct(/tmp/mkpdf.py): N text pages of Helvetica lines, optionally a final embedded-image page (the scanned-PDF case). - Token counts via the free OAuth
count_tokensendpoint (/tmp/ctimg.py,/tmp/ctpdf.py) sending realimage/documentcontent blocks. Counts are non-billable and cache-inert (Volume I, file 13). Rawinput_tokensincludes a constant ~6–8-token user-message wrapper (Volume I measured ~6–7); it is left in the tables and noted, never silently subtracted.
Real validation: the five PNGs in docs/public/ were measured against the synthetic curve and agree
exactly (512×512 icon = 369 tokens synthetic and real).
Measured: the image-token curve
Visual tokens = ⌈width/28⌉ × ⌈height/28⌉, clamped to the model's edge limit and token budget. Raw
input_tokens below includes the wrapper; "patches" is the bare formula.
| Dimensions | Megapixels | patches (formula) | Opus 4.8 measured | Sonnet 4.6 measured | Haiku 4.5 |
|---|---|---|---|---|---|
| 256×256 | 0.07 | 100 | 108 | 110 | — |
| 512×512 | 0.26 | 361 | 369 | 371 | — |
| 1000×1000 | 1.00 | 1,296 | 1,304 | 1,306 | 1,306 |
| 1092×1092 | 1.19 | 1,521 | 1,529 | 1,531 | 1,531 |
| 1280×800 | 1.02 | 1,334 | 1,342 | 1,344 | — |
| 1920×1080 | 2.07 | 2,673 (Opus) | 2,699 | 1,570 (capped) | 1,570 |
| 1536×1536 | 2.36 | 3,025 (Opus) | 3,033 | 1,531 (capped) | — |
| 2560×1440 | 3.69 | over cap | 4,792 (capped) | 1,570 (capped) | 1,570 |
| 2048×2048 | 4.19 | over cap | 4,769 (capped) | 1,531 (capped) | — |
| 4000×3000 | 12.0 | over cap | 4,748 (capped) | 1,574 (capped) | 1,574 |
Two regimes. Below ~1.1 MP both families agree and track the patch formula. Above it each family clamps to its own budget: Sonnet/Haiku downscale to ≤1,568 px edge / ≤1,568 tokens; Opus/ Fable to ≤2,576 px / ≤4,784 tokens. The clamp is why a 4 MP and a 12 MP image cost the same on a given model — extra resolution past the cap is discarded. Treat the published caps as model-side budgets and the measured rows as envelope-inclusive counts; exact totals vary by wrapper.
This reproduces Anthropic's published cost tables (platform.claude.com/docs/en/build-with-claude/ vision#evaluate-image-size): Sonnet 1920×1080 = 1,560 (measured 1,570); Opus 1920×1080 = 2,691 (measured 2,699); both 1000×1000 = 1,296 (measured 1,304/1,306). The docs state the divergence directly: high-resolution models "can use up to approximately 3x more image tokens (4784 versus 1568 tokens per image)." Independent re-measurement found the practical cap around ~4,761 / ~1,523 after subtracting envelope assumptions, so the safe claim is ~3.0–3.1×, not an exact single-value ratio.
Measured: the PDF tax
Each PDF page is billed as a rendered page-image plus extracted text (Anthropic PDF docs, : "The system converts each page of the document into an image. The text from each page is extracted and provided alongside each page's image"). The cost is therefore the image-cap floor plus the text.
| Size | Opus 4.8 | Sonnet 4.6 | Opus tok/page | |
|---|---|---|---|---|
| 1 page × 5 lines (sparse) | <1 KB | 1,742 | 1,700 | 1,742 |
| 1 page × 50 lines (dense) | 5 KB | 3,182 | 2,780 | 3,182 |
| 3 pages × 50 lines | 15 KB | 9,484 | 8,282 | 3,161 |
| 10 pages × 50 lines | 52 KB | 31,541 | 27,539 | 3,154 |
| 25 pages × 50 lines | 130 KB | 78,806 | 68,804 | 3,152 |
| 2 text + 1 image page | 723 KB | 7,886 | 7,083 | — |
Per-page cost is linear and ~3,150 tokens (Opus) for a dense page, matching the docs' "1,500–3,000 tokens per page" and Bedrock's two modes (text-only ≈1,000 tok/3 pages vs full-visual ≈7,000 tok/3 pages — the image rendering is the ~2–3× difference). The tax of the PDF wrapper: the identical 50 lines of text cost 1,605 tokens raw on Opus but 3,182 as a PDF (1.98×); on Sonnet 1,206 → 2,780 (2.30×). A sparse page still floors at ~1,700 because you pay the page-image even with little text.
Measured: screenshot vs. text break-even
What a screenshot replaces, as text:
| Content (one screenful) | As text — Opus | As text — Sonnet | As a full screenshot |
|---|---|---|---|
| 50 lines dense Rust (~2 KB) | 765 | 593 | 1,568 (Sonnet) – 4,784 (Opus) |
| 50 lines wide markdown prose (~4.6 KB) | 1,951 | 1,468 | ~1,520–1,570 (Sonnet) – ~4,760 (Opus) |
For textual content the text is cheaper on essentially every comparison, and scrolls past one screen; the screenshot caps at a single frame and loses exact characters. A screenshot is only cheaper when the information is inherently visual — a rendered chart, a layout bug, a visual diff — where the text description would be long or impossible. On the operator's current environment (Opus 4.8 main loop, measured: 465/560 calls), a full-frame screenshot costs around ~4,760 tokens, so the bias toward text is strongest exactly where the operator is.
Techniques
M1. Vision-tier routing — send screenshots and PDFs to Sonnet/Haiku, not Opus/Fable
The single biggest multimodal lever: the same high-resolution image costs ~3.0–3.1× fewer tokens on the Sonnet/Haiku family because it clamps to the lower image-token budget.
- Coverage-delta: New. Volume I's routing file (16) and tokenizer file (11) cover the text premium but never images; "image"/"vision" is absent from both . The image cap divergence is a distinct, larger (~3.0–3.1×) effect.
- Layer: input (image/document token class) + routing.
- Mechanism: Sonnet 4.6 / Haiku 4.5 downscale any image to ≤1,568 px / ≤1,568 visual tokens; Opus 4.8 / Fable 5 allow ≤2,576 px / around ≤4,784 tokens. For screenshot- and PDF-heavy work the cheaper family caps the per-image cost at roughly a third.
- Expected savings: per full-frame screenshot, roughly 4,760 → 1,520–1,570 tokens = ~−67% on the image token class. A screenshot-driven debugging loop of, say, 20 frames/session shifts roughly 64k tokens off the expensive family; at cache-read rates that is modest in dollars but large in quota (file 41) and in window pressure. A 25-page PDF: 78,806 → 68,804 tokens (−12.7%, the text premium dominates once images are page-sized).
- Evidence tier: T1 — local
count_tokens(method above) + Anthropic vision docs. - Quality risk: QUALITY-TRADE only if the visual needs >1,568-token fidelity (fine print in a hi-res screenshot, dense chart). For UI state, terminal output, and most diagrams, 1,568 tokens is ample. NEGATIVE-COST where a fresh-context cheaper model also reduces confusion. Falsify by running the vision task on both families and grading whether the answer changed.
- Availability: CLAUDE-CODE-TODAY — pin
model: haiku/sonneton the vision-handling subagent. - Effort to adopt: minutes (subagent frontmatter).
- Composability: stacks with Volume I's tokenizer-arbitrage routing (11/16) and subagent fan-out (13 tech 4); the image-handling subagent quarantines the pixels off the main prefix.
- Validation protocol: screenshot 10 representative frames; count each on both families; run the actual vision task (e.g. "what's wrong in this UI?") on both; require equal task success; report image-token delta.
M2. Downsample screenshots to the family cap before sending
A 4K screenshot and a 1,456×819 screenshot cost the same on Sonnet (both clamp to 1,568) — but the 4K one wasted bytes and risks the high-res Opus premium. Resize client-side to the cap.
- Coverage-delta: New. No resolution/detail control appears anywhere in Volume I (0 hits).
- Layer: input (image token class).
- Mechanism: Anthropic resizes server-side to the model's native resolution regardless, so sending pixels beyond the cap buys nothing. Pre-resizing to ≤1,568 px long edge (Sonnet/Haiku) or ≤2,576 px (Opus/Fable) guarantees you pay no high-res premium you didn't intend, and keeps text in the screenshot legible at the resolution the model actually sees.
- Expected savings: on Opus/Fable, a 2560×1440 screenshot downsized to ≤1.1 MP drops 2,699–4,792 → ~1,300 tokens (up to −73%) when the extra fidelity is not needed. On Sonnet it changes nothing past the cap (already clamped) — so this lever matters most on the high-res family, i.e. the operator's current Opus main loop.
- Evidence tier: T1 — local measurement (the curve clamps) + vision docs' resize rule.
- Quality risk: NEUTRAL when fidelity is sufficient; QUALITY-TRADE if you downscale below legibility for fine detail. Falsify by OCR/readback on the downsized image.
- Availability: CLAUDE-CODE-TODAY via a PreToolUse hook that resizes screenshots before they enter context (the screenshot tool path); SDK for programmatic capture.
- Effort to adopt: hours (a resize hook; needs an image lib in the container — see 44/jackin').
- Composability: pairs with M1 (route then size) and M5 (crop then size).
- Validation protocol: capture at native and at capped resolution; confirm identical task success and the expected token drop on Opus.
M3. Text over screenshot for any textual content
Screens of code, logs, DOM, terminal output, and config are 2–6× cheaper as text than as a screenshot of the same screen — and text scrolls past one frame.
- Coverage-delta: New axis. Volume I's context-architecture file (12) argues "don't send it" for text (repo maps, grep-first) but never addresses the screenshot-vs-text choice (0 vision hits).
- Layer: input (choosing text class over image class).
- Mechanism: a full-frame screenshot is a flat 1,568–4,784 tokens regardless of how little text it shows; the same content as text is priced per token and is usually far smaller (dense code screenful 593–765; wide prose 1,468–1,951). Text also preserves exact characters (a screenshot can be downscaled below legibility) and is greppable/diffable downstream.
- Expected savings: replacing a screenshot of a code screen with the text: 1,568–4,784 → ~600–800 tokens = −50% to −85%. The bigger structural win is that text is not capped at one screen, so it scales to the actual content.
- Evidence tier: T1 — local measurement of both forms.
- Quality risk: NEGATIVE-COST for textual content (cheaper and exact). The only failure mode is losing genuinely visual signal (rendered layout, color, spatial relationships) — for those, use a screenshot (M6). Falsify by checking whether the task needed pixels at all.
- Availability: CLAUDE-CODE-TODAY — habit + tool choice (read files/run
gh/curl --markdowninstead of screenshotting; use accessibility-tree/DOM text instead of a browser screenshot when available). - Effort to adopt: minutes (preference); hours to wire text-first browser tools.
- Composability: the multimodal sibling of Volume I's preprocessing/CLI-over-MCP (03 record 20) and repo-maps (12).
- Validation protocol: for 10 tasks where a screenshot was the instinct, try the text path first; require equal success; only fall back to pixels when text genuinely cannot carry the signal.
M4. Markdown/text over PDF — avoid the ~2× document tax
A PDF bills the rendered page-image plus the extracted text. If the same content exists as text/markdown/HTML, sending the PDF roughly doubles the tokens for no quality gain on textual documents.
- Coverage-delta: New. Volume I's only PDF reference is the "~125k tok/500 kB" page-size estimate (03:267, 18:165); the per-page mechanism and the text-vs-PDF tax are unmeasured there.
- Layer: input (document token class).
- Mechanism: measured PDF tax of 1.98× (Opus) / 2.30× (Sonnet) over the identical text; a sparse page still floors at ~1,700 tokens for its rendered image. For born-digital documents whose text is extractable (specs, READMEs, RFCs, API docs), feed the extracted text/markdown; reserve PDF input for documents whose visual layout carries meaning (charts, scanned forms, figures).
- Expected savings: a 25-page text-extractable PDF: 78,806 tokens as PDF vs ~40,000 as extracted text = ~−50%. For a single dense page, 3,182 → 1,605 (Opus), −50%.
- Evidence tier: T1 — local measurement + Anthropic PDF docs ("each page processed as text and image"; Bedrock text-only ≈1,000 vs full ≈7,000 tok/3 pages).
- Quality risk: NEGATIVE-COST for text-extractable docs (you lose nothing the model needs). QUALITY-TRADE if the document's charts/figures/layout are load-bearing — then keep the PDF (or send only the figure pages as images). Falsify by asking a layout-dependent question against both forms.
- Availability: CLAUDE-CODE-TODAY — extract with
pdftotext/a tool, or fetch the HTML/markdown source instead of the PDF. - Effort to adopt: minutes (extract step) to hours (a hook that auto-extracts text-only PDFs).
- Composability: stacks with prompt caching (cache the extracted text once); the figure-only subset pairs with M1 (route those pages to the cheap family).
- Validation protocol: for 5 real PDFs, compare task success on PDF vs extracted-text input; adopt text where success is equal; keep PDF only for the layout-dependent ones.
M5. Crop to the region of interest instead of full-frame capture
Visual tokens scale with area; a crop of the relevant pane is a fraction of the patches of a full 2560×1440 frame.
- Coverage-delta: New (no cropping/region discussion in Volume I).
- Layer: input (image token class).
- Mechanism:
⌈w/28⌉ × ⌈h/28⌉is area-proportional below the cap, so a 640×400 crop = ~330 tokens vs a full 2560×1440 frame at 1,568–4,784. Capture the failing dialog, not the whole desktop. - Expected savings: typical crop to ~10–25% of frame area = −75% to −90% of the image tokens below the cap; above the cap it also avoids triggering the high-res Opus budget.
- Evidence tier: T1 — the measured area-proportional curve.
- Quality risk: NEUTRAL if the crop contains the answer; RISKY if it clips needed context. Falsify by checking task success on crop vs full frame.
- Availability: CLAUDE-CODE-TODAY (capture-region tooling) / SDK.
- Effort to adopt: minutes-to-hours depending on capture tooling.
- Composability: crop → downsize (M2) → route (M1) compose multiplicatively on the image class.
- Validation protocol: 10 UI tasks, crop vs full; require equal success; report token delta.
M6. Lazy vision — screenshot only when text navigation fails, and meter every frame
Treat a screenshot as a 1,568–4,784-token tool call, not a free observation; reach for it only after text paths (DOM, logs, file reads) are exhausted.
- Coverage-delta: New (the lazy-loading idea exists for tools/skills in 12, never for vision).
- Layer: turn-structure (when a vision observation enters context at all).
- Mechanism: each screenshot is the most expensive single observation a coding agent commonly emits — more than most tool results. A policy of "text first, pixels last," plus eviction of stale screenshots from context (they rarely need to persist many turns), keeps the image class small.
- Expected savings: workload-dependent; eliminating half of an exploratory loop's 20 screenshots saves 10 × ~1,568–4,784 = 15,680–47,840 tokens/session, concentrated in the image class and (post-cache) in quota.
- Evidence tier: T1 for per-frame cost; T4 for the session-level estimate (workload-dependent).
- Quality risk: NEUTRAL-to-NEGATIVE-COST — fewer stale frames is also less context rot (12). RISKY only if a needed visual is skipped. Falsify by tracking tasks that failed for lack of a screenshot.
- Availability: CLAUDE-CODE-TODAY (habit + an eviction hook for old image blocks).
- Effort to adopt: minutes (habit) to hours (eviction hook).
- Composability: pairs with context editing/observation masking (Volume I 12/18) applied to image blocks specifically.
- Validation protocol: instrument screenshots-per-task and their re-reference rate; evict frames not referenced within N turns; confirm no task-success drop.
Surprising findings
- The image-token formula is patches, not pixels (
⌈w/28⌉×⌈h/28⌉), and the "÷750" folklore is a coincidental approximation (784 = 28² ≈ 750). Stating it as patches makes the cap behavior obvious. - The high-resolution upgrade that makes Opus 4.7+/Fable better at "computer use, screenshot understanding, and document analysis" (vendor framing) is, on the cost axis, a 3× image-token tax on exactly those workloads — the same lever read two ways. An agent that screenshots a lot pays for fidelity it often does not need.
- A blank-ish PDF page is not cheap: ~1,700 tokens floor because you pay for the rendered page-image regardless of text content. PDFs are the most expensive common input per unit of information.
- Haiku 4.5 and Sonnet 4.6 return byte-identical image counts, just as Volume I found for text — the tokenizer family boundary is the same for vision.
Verification ledger
| # | Number / claim | Source or method |
|---|---|---|
| 1 | Image cost = ⌈w/28⌉×⌈h/28⌉ visual tokens; billed at input price | platform.claude.com/docs/en/build-with-claude/vision (live fetch) |
| 2 | Published caps: Opus 4.8/Fable 5/Opus 4.7 around 4,784 tok / ≤2,576 px edge; other models around 1,568 tok / ≤1,568 px; "~3x more (4784 vs 1568)" | same page |
| 3 | Doc cost tables (Sonnet 1920×1080=1,560, 2000×1500=1,564, 3840×2160=1,560; Opus 1920×1080=2,691, 2000×1500=3,888, 3840×2160=4,784) | same page |
| 4 | Measured image curve (256²=108/110 … 1000²=1,304/1,306 … capped rows around Opus ~4,750–4,792 / Sonnet-Haiku ~1,531–1,574; practical divergence ~3.0–3.1×) | /tmp/mkpng.py (zlib PNG) → /tmp/ctimg.py count_tokens on claude-opus-4-8 / claude-sonnet-4-6 / claude-haiku-4-5; independent re-check in 50 |
| 5 | Repo PNGs validate curve: icon 512×512 = 369; og-image 1200×630 = 997/999; og-github 1280×640 = 1,066/1,068 | count_tokens on docs/public/*.png |
| 6 | PDF: 1pg×5ln = 1,742/1,700; 1pg×50ln = 3,182/2,780; 3/10/25 pg = 9,484/31,541/78,806 (Opus, ~3,150 tok/pg); 2txt+1img = 7,886/7,083 | /tmp/mkpdf.py (zlib, correct xref) → /tmp/ctpdf.py |
| 7 | PDF tax: same 50 lines raw-text Opus 1,605 / Sonnet 1,206 vs PDF 3,182 / 2,780 = 1.98× / 2.30× | count_tokens on identical text vs its 1-page PDF |
| 8 | Per-page "1,500–3,000 tokens"; each page = page-image + extracted text; Bedrock text-only ≈1,000 vs full ≈7,000 tok/3 pages; limits 32 MB / 600 pages (100 for 200k-context) | platform.claude.com/docs/en/build-with-claude/pdf-support (live fetch) |
| 9 | Screenful as text: dense Rust (~2 KB) Opus 765 / Sonnet 593; wide markdown (~4.6 KB) Opus 1,951 / Sonnet 1,468 | count_tokens on real repo files (crates/jackin-capsule/src/git_context.rs L100-149; 03-prior-art-and-market-scan.md L1-50) |
| 10 | Wrapper constant ~6–8 tok ("a" = 7; empty rejected) | count_tokens probe |
| 11 | Local env runs Opus 4.8 main (465/560 calls) + Haiku subagents (95) | transcript scan, ~/.claude/projects/**/*.jsonl |