# 42 — Multimodal token economics: images, screenshots, PDFs (https://jackin.tailrocks.com/research/token-optimization/42-multimodal-token-economics/)



# 42 — Multimodal token economics: images, screenshots, PDFs [#42--multimodal-token-economics-images-screenshots-pdfs]

Volume II area file for blind spot 2 (multimodal), which Volume
I left near-blank: a grep of all 4,303 dossier lines for image/screenshot/vision/PDF terms returns
zero substantive hits, the lone adjacent fact being one "\~125k tokens per 500 kB PDF" page-size
estimate in `03-prior-art-and-market-scan.md:267` and `18-provider-features.md:165`. Every number
below is either a live primary-doc quote or a local `count_tokens`
measurement with the method shown. Plain-language writing rules per Volume I §10.

**TL;DR**

* **An image costs `⌈width/28⌉ × ⌈height/28⌉` visual tokens** (28-pixel patches), billed at the
  model's normal input price — confirmed by local `count_tokens` reproducing Anthropic's published
  table to within the \~6-token message wrapper (e.g. 1000×1000 px = 1,296 visual tokens, measured
  1,304). Volume I never stated the formula.
* **The per-image cap differs \~3× across the two tokenizer families, and it is a routing lever.**
  Docs state Opus 4.8 / Fable 5 cap around **4,784 tokens** (≤2,576 px edge) and Sonnet 4.6 /
  Haiku 4.5 around **1,568 tokens*&#x2A; (≤1,568 px). Local counts include wrapper/envelope effects and
  land around **\~4,760 / \~1,520–1,570*&#x2A;, i.e. &#x2A;*\~3.0–3.1×**. Haiku 4.5 ≡ Sonnet 4.6 on images (identical counts), extending Volume I's tokenizer-family
  equality to vision.
* **A PDF costs \~1,500–3,000 tokens per page and roughly doubles the cost of the same text** —
  because each page is billed as a rendered page-image *plus* extracted text. Measured: the same 50
  text lines cost 1,605 tokens raw (Opus) but &#x2A;*3,182 tokens as a 1-page PDF (1.98×)**; on Sonnet
  1,206 → 2,780 (&#x2A;*2.30×**). Per-page cost is linear and family-divergent: 3,152 tok/page (Opus) vs
  \~2,750 (Sonnet) for a dense page.
* **A screenshot is a token bomb for textual content and a bargain only for visual content.** A dense
  code screenful is 593–765 text tokens but a screenshot of it costs \~1,520–1,570 (Sonnet) to \~4,760 (Opus)
  — **2–6× more**, with worse fidelity. Screenshots win only when the information is inherently
  visual (rendered layout, charts, a visual diff) or when the text equivalent would exceed the cap.
* **Net new levers for the stack:** route vision work to Sonnet/Haiku (\~3.0–3.1× image saving),
  downsample screenshots client-side to the family cap, prefer text/markdown over screenshots and
  PDFs, and crop to the region of interest. All are `CLAUDE-CODE-TODAY` or hook-level and
  NEGATIVE-COST-to-NEUTRAL on quality. None appear in Volume I.

Pricing and the modeled session profile are inherited from `01-economics-and-measurement.md`. The
Fable-family tokenizer is measured on `claude-opus-4-8` (its documented tokenizer twin) because
`count_tokens` rejects `claude-fable-5` (see 40).

***

## Method [#method]

No image or PDF tooling exists on this machine (no PIL, ImageMagick, or qpdf), so test assets were generated from the Python standard library:

* **PNGs** at controlled dimensions via `zlib` (`/tmp/mkpng.py`): a valid RGB PNG with a gradient
  (non-degenerate) so token cost reflects dimensions, not blank-image special cases.
* **PDFs** with byte-offset-correct xref tables via `zlib`/`struct` (`/tmp/mkpdf.py`): N text pages
  of Helvetica lines, optionally a final embedded-image page (the scanned-PDF case).
* **Token counts** via the free OAuth `count_tokens` endpoint (`/tmp/ctimg.py`, `/tmp/ctpdf.py`)
  sending real `image` / `document` content blocks. Counts are non-billable and cache-inert (Volume
  I, file 13). Raw `input_tokens` includes a constant \~6–8-token user-message wrapper (Volume I
  measured \~6–7); it is left in the tables and noted, never silently subtracted.

Real validation: the five PNGs in `docs/public/` were measured against the synthetic curve and agree
exactly (512×512 icon = 369 tokens synthetic and real).

## Measured: the image-token curve [#measured-the-image-token-curve]

Visual tokens = `⌈width/28⌉ × ⌈height/28⌉`, clamped to the model's edge limit and token budget. Raw
`input_tokens` below includes the wrapper; "patches" is the bare formula.

| Dimensions | Megapixels | patches (formula) | Opus 4.8 measured  | Sonnet 4.6 measured | Haiku 4.5 |
| ---------- | ---------- | ----------------- | ------------------ | ------------------- | --------- |
| 256×256    | 0.07       | 100               | 108                | 110                 | —         |
| 512×512    | 0.26       | 361               | 369                | 371                 | —         |
| 1000×1000  | 1.00       | 1,296             | 1,304              | 1,306               | 1,306     |
| 1092×1092  | 1.19       | 1,521             | 1,529              | 1,531               | 1,531     |
| 1280×800   | 1.02       | 1,334             | 1,342              | 1,344               | —         |
| 1920×1080  | 2.07       | 2,673 (Opus)      | 2,699              | **1,570 (capped)**  | 1,570     |
| 1536×1536  | 2.36       | 3,025 (Opus)      | 3,033              | **1,531 (capped)**  | —         |
| 2560×1440  | 3.69       | over cap          | **4,792 (capped)** | **1,570 (capped)**  | 1,570     |
| 2048×2048  | 4.19       | over cap          | **4,769 (capped)** | **1,531 (capped)**  | —         |
| 4000×3000  | 12.0       | over cap          | **4,748 (capped)** | **1,574 (capped)**  | 1,574     |

Two regimes. **Below \~1.1 MP both families agree** and track the patch formula. &#x2A;*Above it each
family clamps to its own budget:** Sonnet/Haiku downscale to ≤1,568 px edge / ≤1,568 tokens; Opus/
Fable to ≤2,576 px / ≤4,784 tokens. The clamp is why a 4 MP and a 12 MP image cost the *same* on a
given model — extra resolution past the cap is discarded. Treat the published caps as model-side
budgets and the measured rows as envelope-inclusive counts; exact totals vary by wrapper.

This reproduces Anthropic's published cost tables (platform.claude.com/docs/en/build-with-claude/
vision#evaluate-image-size): Sonnet 1920×1080 = 1,560 (measured 1,570); Opus
1920×1080 = 2,691 (measured 2,699); both 1000×1000 = 1,296 (measured 1,304/1,306). The docs state the
divergence directly: high-resolution models "can use up to approximately 3x more image tokens (4784
versus 1568 tokens per image)." Independent re-measurement found the practical cap
around \~4,761 / \~1,523 after subtracting envelope assumptions, so the safe claim is &#x2A;*\~3.0–3.1×**,
not an exact single-value ratio.

## Measured: the PDF tax [#measured-the-pdf-tax]

Each PDF page is billed as a rendered page-image **plus** extracted text (Anthropic PDF docs,
: "The system converts each page of the document into an image. The text from
each page is extracted and provided alongside each page's image"). The cost is therefore the
image-cap floor *plus* the text.

| PDF                       | Size   | Opus 4.8 | Sonnet 4.6 | Opus tok/page |
| ------------------------- | ------ | -------- | ---------- | ------------- |
| 1 page × 5 lines (sparse) | \<1 KB | 1,742    | 1,700      | 1,742         |
| 1 page × 50 lines (dense) | 5 KB   | 3,182    | 2,780      | 3,182         |
| 3 pages × 50 lines        | 15 KB  | 9,484    | 8,282      | 3,161         |
| 10 pages × 50 lines       | 52 KB  | 31,541   | 27,539     | 3,154         |
| 25 pages × 50 lines       | 130 KB | 78,806   | 68,804     | 3,152         |
| 2 text + 1 image page     | 723 KB | 7,886    | 7,083      | —             |

Per-page cost is **linear and \~3,150 tokens (Opus) for a dense page**, matching the docs'
"1,500–3,000 tokens per page" and Bedrock's two modes (text-only ≈1,000 tok/3 pages vs full-visual
≈7,000 tok/3 pages — the image rendering is the \~2–3× difference). The **tax of the PDF wrapper**:
the identical 50 lines of text cost 1,605 tokens raw on Opus but 3,182 as a PDF (&#x2A;*1.98×**); on
Sonnet 1,206 → 2,780 (&#x2A;*2.30×**). A sparse page still floors at \~1,700 because you pay the page-image
even with little text.

## Measured: screenshot vs. text break-even [#measured-screenshot-vs-text-break-even]

What a screenshot replaces, as text:

| Content (one screenful)                 | As text — Opus | As text — Sonnet | As a full screenshot                    |
| --------------------------------------- | -------------- | ---------------- | --------------------------------------- |
| 50 lines dense Rust (\~2 KB)            | 765            | 593              | 1,568 (Sonnet) – 4,784 (Opus)           |
| 50 lines wide markdown prose (\~4.6 KB) | 1,951          | 1,468            | \~1,520–1,570 (Sonnet) – \~4,760 (Opus) |

For **textual** content the text is cheaper on essentially every comparison, and scrolls past one
screen; the screenshot caps at a single frame and loses exact characters. A screenshot is only
cheaper when the information is **inherently visual** — a rendered chart, a layout bug, a visual diff
— where the text description would be long or impossible. On the operator's current environment
(Opus 4.8 main loop, measured: 465/560 calls), a full-frame screenshot costs around \~4,760 tokens, so
the bias toward text is strongest exactly where the operator is.

***

## Techniques [#techniques]

### M1. Vision-tier routing — send screenshots and PDFs to Sonnet/Haiku, not Opus/Fable [#m1-vision-tier-routing--send-screenshots-and-pdfs-to-sonnethaiku-not-opusfable]

The single biggest multimodal lever: the same high-resolution image costs \~3.0–3.1× fewer tokens on
the Sonnet/Haiku family because it clamps to the lower image-token budget.

* **Coverage-delta:** New. Volume I's routing file (16) and tokenizer file (11) cover the **text**
  premium but never images; "image"/"vision" is absent from both
  . The image cap divergence is a distinct, larger (\~3.0–3.1×) effect.
* **Layer:** input (image/document token class) + routing.
* **Mechanism:** Sonnet 4.6 / Haiku 4.5 downscale any image to ≤1,568 px / ≤1,568 visual tokens;
  Opus 4.8 / Fable 5 allow ≤2,576 px / around ≤4,784 tokens. For screenshot- and PDF-heavy work the cheaper
  family caps the per-image cost at roughly a third.
* **Expected savings:*&#x2A; per full-frame screenshot, roughly 4,760 → 1,520–1,570 tokens = &#x2A;*\~−67%** on the image
  token class. A screenshot-driven debugging loop of, say, 20 frames/session shifts roughly
  64k tokens off the expensive family; at cache-read rates that is modest in dollars but
  large in **quota** (file 41) and in window pressure. A 25-page PDF: 78,806 → 68,804 tokens
  (−12.7%, the text premium dominates once images are page-sized).
* **Evidence tier:** T1 — local `count_tokens` (method above) + Anthropic vision docs.
* **Quality risk:** **QUALITY-TRADE only if the visual needs >1,568-token fidelity** (fine print in a
  hi-res screenshot, dense chart). For UI state, terminal output, and most diagrams, 1,568 tokens is
  ample. NEGATIVE-COST where a fresh-context cheaper model also reduces confusion. Falsify by running
  the vision task on both families and grading whether the answer changed.
* **Availability:** CLAUDE-CODE-TODAY — pin `model: haiku`/`sonnet` on the vision-handling subagent.
* **Effort to adopt:** minutes (subagent frontmatter).
* **Composability:** stacks with Volume I's tokenizer-arbitrage routing (11/16) and subagent fan-out
  (13 tech 4); the image-handling subagent quarantines the pixels off the main prefix.
* **Validation protocol:** screenshot 10 representative frames; count each on both families; run the
  actual vision task (e.g. "what's wrong in this UI?") on both; require equal task success; report
  image-token delta.

### M2. Downsample screenshots to the family cap before sending [#m2-downsample-screenshots-to-the-family-cap-before-sending]

A 4K screenshot and a 1,456×819 screenshot cost the *same* on Sonnet (both clamp to 1,568) — but the
4K one wasted bytes and risks the high-res Opus premium. Resize client-side to the cap.

* **Coverage-delta:** New. No resolution/detail control appears anywhere in Volume I (0 hits).
* **Layer:** input (image token class).
* **Mechanism:** Anthropic resizes server-side to the model's native resolution regardless, so
  sending pixels beyond the cap buys nothing. Pre-resizing to ≤1,568 px long edge (Sonnet/Haiku) or
  ≤2,576 px (Opus/Fable) guarantees you pay no high-res premium you didn't intend, and keeps text in
  the screenshot legible at the resolution the model actually sees.
* **Expected savings:** on Opus/Fable, a 2560×1440 screenshot downsized to ≤1.1 MP drops 2,699–4,792
  → \~1,300 tokens (&#x2A;*up to −73%**) when the extra fidelity is not needed. On Sonnet it changes
  nothing past the cap (already clamped) — so this lever matters most on the high-res family, i.e.
  the operator's current Opus main loop.
* **Evidence tier:** T1 — local measurement (the curve clamps) + vision docs' resize rule.
* **Quality risk:** **NEUTRAL** when fidelity is sufficient; QUALITY-TRADE if you downscale below
  legibility for fine detail. Falsify by OCR/readback on the downsized image.
* **Availability:** CLAUDE-CODE-TODAY via a PreToolUse hook that resizes screenshots before they
  enter context (the screenshot tool path); SDK for programmatic capture.
* **Effort to adopt:** hours (a resize hook; needs an image lib in the container — see 44/jackin').
* **Composability:** pairs with M1 (route then size) and M5 (crop then size).
* **Validation protocol:** capture at native and at capped resolution; confirm identical task success
  and the expected token drop on Opus.

### M3. Text over screenshot for any textual content [#m3-text-over-screenshot-for-any-textual-content]

Screens of code, logs, DOM, terminal output, and config are 2–6× cheaper as text than as a
screenshot of the same screen — and text scrolls past one frame.

* **Coverage-delta:** New axis. Volume I's context-architecture file (12) argues "don't send it" for
  text (repo maps, grep-first) but never addresses the screenshot-vs-text choice (0 vision hits).
* **Layer:** input (choosing text class over image class).
* **Mechanism:** a full-frame screenshot is a flat 1,568–4,784 tokens regardless of how little text
  it shows; the same content as text is priced per token and is usually far smaller (dense code
  screenful 593–765; wide prose 1,468–1,951). Text also preserves exact characters (a screenshot can
  be downscaled below legibility) and is greppable/diffable downstream.
* **Expected savings:*&#x2A; replacing a screenshot of a code screen with the text: 1,568–4,784 → \~600–800
  tokens = &#x2A;*−50% to −85%**. The bigger structural win is that text is not capped at one screen, so
  it scales to the actual content.
* **Evidence tier:** T1 — local measurement of both forms.
* **Quality risk:** **NEGATIVE-COST** for textual content (cheaper *and* exact). The only failure mode
  is losing genuinely visual signal (rendered layout, color, spatial relationships) — for those, use
  a screenshot (M6). Falsify by checking whether the task needed pixels at all.
* **Availability:** CLAUDE-CODE-TODAY — habit + tool choice (read files/run `gh`/`curl --markdown`
  instead of screenshotting; use accessibility-tree/DOM text instead of a browser screenshot when
  available).
* **Effort to adopt:** minutes (preference); hours to wire text-first browser tools.
* **Composability:** the multimodal sibling of Volume I's preprocessing/CLI-over-MCP (03 record 20)
  and repo-maps (12).
* **Validation protocol:** for 10 tasks where a screenshot was the instinct, try the text path first;
  require equal success; only fall back to pixels when text genuinely cannot carry the signal.

### M4. Markdown/text over PDF — avoid the \~2× document tax [#m4-markdowntext-over-pdf--avoid-the-2-document-tax]

A PDF bills the rendered page-image plus the extracted text. If the same content exists as
text/markdown/HTML, sending the PDF roughly doubles the tokens for no quality gain on textual
documents.

* **Coverage-delta:** New. Volume I's only PDF reference is the "\~125k tok/500 kB" page-size estimate
  (03:267, 18:165); the per-page mechanism and the text-vs-PDF tax are unmeasured there.
* **Layer:** input (document token class).
* **Mechanism:** measured PDF tax of 1.98× (Opus) / 2.30× (Sonnet) over the identical text; a sparse
  page still floors at \~1,700 tokens for its rendered image. For born-digital documents whose text is
  extractable (specs, READMEs, RFCs, API docs), feed the extracted text/markdown; reserve PDF input
  for documents whose *visual layout* carries meaning (charts, scanned forms, figures).
* **Expected savings:*&#x2A; a 25-page text-extractable PDF: 78,806 tokens as PDF vs \~40,000 as extracted
  text = &#x2A;*\~−50%*&#x2A;. For a single dense page, 3,182 → 1,605 (Opus), &#x2A;*−50%**.
* **Evidence tier:** T1 — local measurement + Anthropic PDF docs ("each page processed as text and
  image"; Bedrock text-only ≈1,000 vs full ≈7,000 tok/3 pages).
* **Quality risk:** **NEGATIVE-COST** for text-extractable docs (you lose nothing the model needs).
  QUALITY-TRADE if the document's charts/figures/layout are load-bearing — then keep the PDF (or send
  only the figure pages as images). Falsify by asking a layout-dependent question against both forms.
* **Availability:** CLAUDE-CODE-TODAY — extract with `pdftotext`/a tool, or fetch the HTML/markdown
  source instead of the PDF.
* **Effort to adopt:** minutes (extract step) to hours (a hook that auto-extracts text-only PDFs).
* **Composability:** stacks with prompt caching (cache the extracted text once); the figure-only
  subset pairs with M1 (route those pages to the cheap family).
* **Validation protocol:** for 5 real PDFs, compare task success on PDF vs extracted-text input;
  adopt text where success is equal; keep PDF only for the layout-dependent ones.

### M5. Crop to the region of interest instead of full-frame capture [#m5-crop-to-the-region-of-interest-instead-of-full-frame-capture]

Visual tokens scale with area; a crop of the relevant pane is a fraction of the patches of a full
2560×1440 frame.

* **Coverage-delta:** New (no cropping/region discussion in Volume I).
* **Layer:** input (image token class).
* **Mechanism:** `⌈w/28⌉ × ⌈h/28⌉` is area-proportional below the cap, so a 640×400 crop = \~330
  tokens vs a full 2560×1440 frame at 1,568–4,784. Capture the failing dialog, not the whole desktop.
* **Expected savings:*&#x2A; typical crop to \~10–25% of frame area = &#x2A;*−75% to −90%** of the image tokens
  below the cap; above the cap it also avoids triggering the high-res Opus budget.
* **Evidence tier:** T1 — the measured area-proportional curve.
* **Quality risk:** **NEUTRAL** if the crop contains the answer; RISKY if it clips needed context.
  Falsify by checking task success on crop vs full frame.
* **Availability:** CLAUDE-CODE-TODAY (capture-region tooling) / SDK.
* **Effort to adopt:** minutes-to-hours depending on capture tooling.
* **Composability:** crop → downsize (M2) → route (M1) compose multiplicatively on the image class.
* **Validation protocol:** 10 UI tasks, crop vs full; require equal success; report token delta.

### M6. Lazy vision — screenshot only when text navigation fails, and meter every frame [#m6-lazy-vision--screenshot-only-when-text-navigation-fails-and-meter-every-frame]

Treat a screenshot as a 1,568–4,784-token tool call, not a free observation; reach for it only after
text paths (DOM, logs, file reads) are exhausted.

* **Coverage-delta:** New (the lazy-loading idea exists for tools/skills in 12, never for vision).
* **Layer:** turn-structure (when a vision observation enters context at all).
* **Mechanism:** each screenshot is the most expensive single observation a coding agent commonly
  emits — more than most tool results. A policy of "text first, pixels last," plus eviction of stale
  screenshots from context (they rarely need to persist many turns), keeps the image class small.
* **Expected savings:** workload-dependent; eliminating half of an exploratory loop's 20
  screenshots saves 10 × \~1,568–4,784 = 15,680–47,840 tokens/session, concentrated in the image
  class and (post-cache) in quota.
* **Evidence tier:** T1 for per-frame cost; T4 for the session-level estimate (workload-dependent).
* **Quality risk:** **NEUTRAL-to-NEGATIVE-COST** — fewer stale frames is also less context rot
  (12). RISKY only if a needed visual is skipped. Falsify by tracking tasks that failed for lack of a
  screenshot.
* **Availability:** CLAUDE-CODE-TODAY (habit + an eviction hook for old image blocks).
* **Effort to adopt:** minutes (habit) to hours (eviction hook).
* **Composability:** pairs with context editing/observation masking (Volume I 12/18) applied to image
  blocks specifically.
* **Validation protocol:** instrument screenshots-per-task and their re-reference rate; evict frames
  not referenced within N turns; confirm no task-success drop.

***

## Surprising findings [#surprising-findings]

* The image-token formula is **patches, not pixels** (`⌈w/28⌉×⌈h/28⌉`), and the "÷750" folklore is a
  coincidental approximation (784 = 28² ≈ 750). Stating it as patches makes the cap behavior obvious.
* The high-resolution upgrade that makes Opus 4.7+/Fable better at "computer use, screenshot
  understanding, and document analysis" (vendor framing) is, on the cost axis, a **3× image-token tax
  on exactly those workloads** — the same lever read two ways. An agent that screenshots a lot pays
  for fidelity it often does not need.
* A blank-ish PDF page is not cheap: \~1,700 tokens floor because you pay for the rendered page-image
  regardless of text content. PDFs are the most expensive common input per unit of information.
* Haiku 4.5 and Sonnet 4.6 return byte-identical image counts, just as Volume I found for text —
  the tokenizer family boundary is the same for vision.

## Verification ledger [#verification-ledger]

| #  | Number / claim                                                                                                                                                               | Source or method                                                                                                                                 |
| -- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
| 1  | Image cost = `⌈w/28⌉×⌈h/28⌉` visual tokens; billed at input price                                                                                                            | platform.claude.com/docs/en/build-with-claude/vision (live fetch)                                                                                |
| 2  | Published caps: Opus 4.8/Fable 5/Opus 4.7 around 4,784 tok / ≤2,576 px edge; other models around 1,568 tok / ≤1,568 px; "\~3x more (4784 vs 1568)"                           | same page                                                                                                                                        |
| 3  | Doc cost tables (Sonnet 1920×1080=1,560, 2000×1500=1,564, 3840×2160=1,560; Opus 1920×1080=2,691, 2000×1500=3,888, 3840×2160=4,784)                                           | same page                                                                                                                                        |
| 4  | Measured image curve (256²=108/110 … 1000²=1,304/1,306 … capped rows around Opus \~4,750–4,792 / Sonnet-Haiku \~1,531–1,574; practical divergence \~3.0–3.1×)                | `/tmp/mkpng.py` (zlib PNG) → `/tmp/ctimg.py` count\_tokens on claude-opus-4-8 / claude-sonnet-4-6 / claude-haiku-4-5; independent re-check in 50 |
| 5  | Repo PNGs validate curve: icon 512×512 = 369; og-image 1200×630 = 997/999; og-github 1280×640 = 1,066/1,068                                                                  | count\_tokens on `docs/public/*.png`                                                                                                             |
| 6  | PDF: 1pg×5ln = 1,742/1,700; 1pg×50ln = 3,182/2,780; 3/10/25 pg = 9,484/31,541/78,806 (Opus, \~3,150 tok/pg); 2txt+1img = 7,886/7,083                                         | `/tmp/mkpdf.py` (zlib, correct xref) → `/tmp/ctpdf.py`                                                                                           |
| 7  | PDF tax: same 50 lines raw-text Opus 1,605 / Sonnet 1,206 vs PDF 3,182 / 2,780 = 1.98× / 2.30×                                                                               | count\_tokens on identical text vs its 1-page PDF                                                                                                |
| 8  | Per-page "1,500–3,000 tokens"; each page = page-image + extracted text; Bedrock text-only ≈1,000 vs full ≈7,000 tok/3 pages; limits 32 MB / 600 pages (100 for 200k-context) | platform.claude.com/docs/en/build-with-claude/pdf-support (live fetch)                                                                           |
| 9  | Screenful as text: dense Rust (\~2 KB) Opus 765 / Sonnet 593; wide markdown (\~4.6 KB) Opus 1,951 / Sonnet 1,468                                                             | count\_tokens on real repo files (`crates/jackin-capsule/src/git_context.rs` L100-149; `03-prior-art-and-market-scan.md` L1-50)                  |
| 10 | Wrapper constant \~6–8 tok ("a" = 7; empty rejected)                                                                                                                         | count\_tokens probe                                                                                                                              |
| 11 | Local env runs Opus 4.8 main (465/560 calls) + Haiku subagents (95)                                                                                                          | transcript scan, `~/.claude/projects/**/*.jsonl`                                                                                                 |
