# 15 — Output discipline and structured generation (https://jackin.tailrocks.com/research/token-optimization/15-output-discipline/)


# 15 — Output discipline and structured generation [#15--output-discipline-and-structured-generation]

**TL;DR**

* Output tokens are the 5x-priced class ($50/MTok out vs $10 in on Fable 5) and \~37% of a heavy day's dollars (thinking 20% + visible output 17%, local measurement). Locally, thinking is 54.8% of output tokens at max effort (n=1) — and on Fable 5 **only the effort parameter can touch that share**; thinking cannot be disabled.
* Biggest sanctioned number in this space: **Opus 4.5 at medium effort matched Sonnet 4.5's best SWE-bench Verified score with 76% fewer output tokens** (Anthropic announcement). No equivalent curve exists for Fable 5 / Opus 4.8 — reproduce before standardizing.
* Biggest *measured-here* number: &#x2A;*Edit-style diffs beat whole-file rewrites by 89.3%** on a realistic 5-line change to a 200-line repo file (347 vs 3,197 tok, count\_tokens, claude-fable-5) — and aider's benchmark says diff formats simultaneously *improve* quality (20%→61%). Rare NEGATIVE-COST.
* Suppressing post-edit code restatement saves a measured &#x2A;*352 tok/edit (91.4%)**; a "Here is the JSON…" wrapper costs a measured **29 tok/call**.
* Terseness has a floor: "be concise" captures only \~1.4x of a 4.8–11.2x theoretically-safe compression, degradation past per-problem token thresholds is a cliff not a slope, and >10x compression measurably increases hallucination. Route brevity at restatement and ceremony, never at reasoning. Techniques that trade quality are marked **QUALITY-TRADE**.

## Why this layer matters [#why-this-layer-matters]

On the modeled heavy day (\~$22, 6 sessions, \~27k output tok/session — see 01-economics-and-measurement.md), output is \~$8.14/day: \~$4.40 thinking + \~$3.74 visible (prose + tool arguments). Three structural facts shape everything below:

1. **Thinking is the majority of output (54.8% locally) and is prompt-immune on Fable 5.** "Thinking cannot be turned off on Fable 5. The session toggle, `alwaysThinkingEnabled`, and `MAX_THINKING_TOKENS=0` have no effect there" (code.claude.com/docs/en/model-config). Effort is the only knob.
2. **Tool arguments are output.** A Write call's `content` field is model-generated text billed at $50/MTok. Edit-vs-Write is therefore an output-discipline question, not a tooling preference.
3. **Visible prose is the smallest slice (\~17% of dollars) but the only slice prompts can address directly** — which is why prompt-level brevity folklore systematically overpromises (it can't reach the other 83%).

## Local measurements [#local-measurements]

Method: free `count_tokens` API, model `claude-fable-5`, each payload sent as a single user message; 1-char baseline message = 7 tok envelope, subtracted in % columns. Measurement script reproduced in 31-validation-harness.md style: simulate the exact JSON a tool call would carry. File under test: `crates/jackin-config/src/app_config_workspaces.rs` (200 lines); change under test: a realistic 5-line method addition with a 5-line unique anchor.

| Measurement                                                         | Tokens | Comparator                                                           | Tokens | Saved               |
| ------------------------------------------------------------------- | ------ | -------------------------------------------------------------------- | ------ | ------------------- |
| Edit tool JSON payload (old\_string+new\_string)                    | 347    | Write tool JSON payload (full new file)                              | 3,197  | **89.3%**           |
| Same, raw strings without JSON escaping                             | 256    | raw full file content                                                | 2,851  | 91.2%               |
| Terse post-edit summary (`Added has_workspace at &lt;file&gt;:20.`) | 40     | verbose summary restating an 11-line code block + narration + closer | 392    | **352 tok (91.4%)** |
| Bare JSON object (extraction answer)                                | 35     | same object in "Here is the JSON…" + fences + closer                 | 64     | **29 tok/call**     |
| Write-vs-Edit crossover (ESTIMATE from above)                       | —      | 3,190 net Write ÷ 340 net per-hunk Edit                              | —      | ≈ 9.4 hunks         |

These independently reproduce the Phase-0 sweep numbers (89.5% / 302 tok / 15–25 tok est.) within noise of payload phrasing. At $50/MTok: an avoided whole-file rewrite of this file saves \~$0.143; an avoided verbose summary \~$0.018; an avoided JSON wrapper \~$0.0015.

***

## Techniques [#techniques]

### 1. Effort parameter (`/effort`, `output_config.effort`) [#1-effort-parameter-effort-output_configeffort]

One-line pitch: Anthropic's only sanctioned whole-response throttle — one knob that shrinks thinking, prose, and tool-call count together, with the strongest published number in this entire dossier.

* **Layer:** output (all sub-classes: thinking, text, tool arguments).
* **Mechanism:** "Effort is a behavioral signal, not a strict token budget" (platform effort doc). Levels `low|medium|high|xhigh|max`; `high` "produces exactly the same behavior as omitting the parameter". On adaptive-thinking models (Fable 5, Opus 4.7+) effort is the recommended/only thinking-depth control; `budget_tokens` is deprecated (4.6) or rejected (4.7+). Lower effort documented to "combine multiple operations into fewer tool calls… proceed directly to action without preamble… use terse confirmation messages". Claude Code surfaces: `/effort` slider, `--effort`, `effortLevel` setting, `CLAUDE_CODE_EFFORT_LEVEL` env var, and per-skill/per-subagent `effort:` frontmatter. Defaults in Claude Code: `high` on Fable 5/Opus 4.8/Opus 4.6/Sonnet 4.6, `xhigh` on Opus 4.7 (raw API default is `high` everywhere — different surfaces, both). `max` is session-only and "prone to overthinking. Test before adopting broadly". `ultracode` = `xhigh` + workflow orchestration, not an API level; `ultrathink` = one-turn in-context instruction, API effort unchanged.
* **Expected savings:** T1 anchor: "Set to a medium effort level, Opus 4.5 matches Sonnet 4.5's best score on SWE-bench Verified, but uses 76% fewer output tokens"; at highest effort "+4.3 percentage points… 48% fewer tokens"; long-horizon "up to 65% fewer"; GitHub Copilot "cutting token usage in half" (Opus 4.5 announcement). ESTIMATE on the modeled profile *if* 76% transferred to Fable 5 high→medium: 0.76 × $8.14 ≈ $6.19/day, plus unmodeled second-order savings from fewer tool calls (smaller cache-write growth). That transfer is unproven — the figures are Opus 4.5 + SWE-bench only; the effort scale "is calibrated per model, so the same level name does not represent the same underlying value across models".
* **Evidence tier:** T1 (anthropic.com/news/claude-opus-4-5; platform.claude.com/docs/en/build-with-claude/effort; code.claude.com/docs/en/model-config — all).
* **Quality risk:** stepping down **from max**: NEGATIVE-COST (docs: max "may show diminishing returns and is prone to overthinking"). high→medium/low: **QUALITY-TRADE** — Opus 4.7+ "respects effort levels more strictly… scopes its work to what was asked rather than going above and beyond"; degradation manifests as shallower investigation, skipped verification, fewer edge cases handled. Notably Anthropic itself recommends `medium` as the default for Sonnet 4.6. Falsification: your eval suite passes at medium with ≥ high's pass rate.
* **Availability:** CLAUDE-CODE-TODAY.
* **Effort to adopt:** minutes (`/effort medium`; `effort: low` in subagent frontmatter).
* **Composability:** composes with everything below; it is the *only* lever on the 54.8% thinking share. Per-subagent `effort: low` under a high-effort main loop is a shipped, documented pattern almost no cost guide mentions. Anti-synergy: partially redundant with terse-by-instruction (#3) — low effort already produces terse confirmations.
* **Validation protocol:** fix a 10-task repo-representative suite (bugfix, refactor, test-add). Run 3x at high and at medium with identical prompts. Record `usage.output_tokens`, wall time, pass rate (tests green), and thinking share (output\_tokens minus count\_tokens of visible blocks — same method as Phase-0). Adopt medium where pass-rate delta ≤ 0; this experiment also closes the open thinking-fraction-by-effort gap (#2 in Gaps).

### 2. Diff-style edits (Edit) instead of whole-file rewrites (Write) [#2-diff-style-edits-edit-instead-of-whole-file-rewrites-write]

One-line pitch: a 5-line change costs 10x less via Edit than Write — measured here at 89.3% — and diff-style editing independently *improves* code quality, making this the cleanest NEGATIVE-COST item in the dossier.

* **Layer:** output (tool arguments).
* **Mechanism:** Edit emits only `old_string`+`new_string` (change + minimal unique anchor); Write re-emits the entire file as generated output billed at $50/MTok. Aider's complementary quality finding: diff-style formats make the model treat edits "as data" rather than chat, suppressing lazy placeholder comments; their flexible patch-application layer is load-bearing ("Experiments where flexible patching is disabled show a 9X increase in editing errors").
* **Expected savings:** (table above): 347 vs 3,197 tok JSON payloads = 89.3% (10.4x) on a 200-line file; $0.160 vs $0.017 per edit. Savings scale with file-size:change-size ratio. ESTIMATE on modeled profile: if 10 would-be whole-file rewrites/day become Edits, \~28.5k output tok ≈ $1.43/day. Crossover ESTIMATE: Write becomes cheaper only past \~9.4 scattered hunks of this size on this file (3,190 ÷ 340); below that, multiple Edit calls still win.
* **Evidence tier:** T1 — local reproduction (method above) + aider benchmark: GPT-4 Turbo (`gpt-4-1106-preview`) 20% → 61% on an 89-task refactoring benchmark, lazy comments on 12 → 4 of 89 tasks ("reduced laziness by 3X") (aider.chat/docs/unified-diffs.html). Caveat: aider numbers are Dec-2023/GPT-4-Turbo era — stale for Claude 4.6+/5, though Anthropic keeping string-replace as Claude Code's default editing primitive is implicit confirmation.
* **Quality risk:** &#x2A;*NEGATIVE-COST.** Residual: failed `old_string` matches force retry round-trips; degradation manifests as repeated "String to replace not found" tool errors. Falsification: count failed-match retries per 100 edits before/after; if retry waste exceeds rewrite savings on your codebase, the anchor-discipline (not the technique) is broken.
* **Availability:** CLAUDE-CODE-TODAY (default behavior).
* **Effort to adopt:** zero for defaults; minutes for a one-line CLAUDE.md guard: "prefer Edit over Write for existing files; never rewrite a file to change a few lines."
* **Composability:** disjoint from #3 (tool args vs prose) — they stack. Orthogonal to effort. Anti-synergy: none known.
* **Validation protocol:** replay 20 historical edits from transcripts both ways (reconstruct the Write payload from the post-state); count both payloads via count\_tokens; `cargo check` the results for equivalence; track failed-match retry rate. Accept if retries × full-payload cost \< 10% of measured savings.

### 3. Suppress restated code and post-action narration (terse-by-instruction) [#3-suppress-restated-code-and-post-action-narration-terse-by-instruction]

One-line pitch: every "Here's what the updated section now looks like:" + re-quoted block costs a measured 352 tokens the user can already see in the tool-call diff.

* **Layer:** output (visible prose).
* **Mechanism:** CLAUDE.md/system-prompt instruction: reference `file:line` instead of re-quoting; no preamble/postamble; no offer-to-do-more closers. The transcript already renders the diff, so restatement is pure duplication. Strongest precedent: Claude Code's own shipped system prompt — community-documented versions instruct answering in "fewer than 4 lines… unless user asks for detail" and to "minimize output tokens" (NOT locally verifiable: the v2.1.175 binary is a compiled Bun bundle, strings not greppable; T3 for that sub-claim).
* **Expected savings:**: verbose post-edit summary 392 tok vs terse 40 tok = 352 tok/edit (91.4%). ESTIMATE: 50 edits/heavy-day × 352 tok = 17.6k tok ≈ $0.88/day at $50/MTok; per modeled session, 10 verbose summaries ≈ 3.5k tok ≈ 29% of the 12.2k visible-output budget. Measure your actual restatement frequency before claiming the multiple.
* **Evidence tier:** T1 for the per-instance delta (local measurement); T3 for the shipped-prompt precedent; the documented low-effort behaviors ("proceed directly to action without preamble… terse confirmation messages", effort doc) confirm the same behaviors are sanctioned.
* **Quality risk:** NEUTRAL→**NEGATIVE-COST** for code restatement (information is duplicated); **RISKY if extended to caveats/warnings** — see #10. Degradation manifests as the operator re-asking "what changed?" or missing a warning that was squeezed out. Falsification: A/B the rule and count re-ask turns.
* **Availability:** CLAUDE-CODE-TODAY.
* **Effort to adopt:** minutes (1–2 CLAUDE.md lines). The rule itself is always-on input (root <RepoFile path="AGENTS.md">AGENTS.md</RepoFile> is already 2,738 tok, local measurement) but cached at 0.1x — a \~20-tok rule pays for itself after a single avoided restatement.
* **Composability:** stacks cleanly with #2 (disjoint token populations). Partially redundant with effort=low (#1) which already produces terse confirmations — adopting both, attribute savings carefully. Unknown: instruction persistence over 100+ turn sessions (Gaps #3).
* **Validation protocol:** two matched days of real sessions with/without the rule; per session count visible-output tokens per edit action (transcript parse) and operator re-ask rate. Accept if tokens/edit drops ≥30% with zero re-ask increase.

### 4. Chain of Draft / Concise CoT (visible-reasoning compression) — **QUALITY-TRADE** [#4-chain-of-draft--concise-cot-visible-reasoning-compression--quality-trade]

One-line pitch: "keep each thinking step to 5 words" cuts response tokens \~79–92% on benchmark tasks — but the famous "7.6% of tokens" is the single best-case cell, and on GSM8K it costs 4.3–4.4 accuracy points.

* **Layer:** output (visible prose on non-thinking calls).
* **Mechanism:** prompt-level cap on per-step (CoD) or whole-response (CCoT "Be concise") length, applied to *visible* chain-of-thought. Critical Claude Code caveat: on Fable 5/Opus 4.7+ reasoning happens in billed thinking blocks that prompts do not cap (effort does); CoD-style instructions only bite on non-thinking calls — Haiku 4.5 sidecars, thinking-disabled Sonnet subagents — or on visible answer structure.
* **Expected savings:** CoD (arXiv 2502.18600, Tables 1/3): GSM8K GPT-4o CoT 95.4% @ 205.1 tok vs CoD 91.1% @ 43.9 tok (−78.6% tokens, −4.3 pts); Claude 3.5 Sonnet 95.8% @ 190.0 vs 91.4% @ 39.8 (−79.1%, −4.4 pts); sports understanding Claude CoT 93.2% @ 189.4 vs CoD 97.3% @ 14.3 — the "7.6% of tokens" headline AND +4.1 pts. CCoT (arXiv 2401.05618): −48.70% response length both GPT-3.5/4, −22.67% average per-token cost, but −27.69% accuracy on math for GPT-3.5. On the modeled profile: main-loop impact ≈ 0 (thinking-immune); for a Haiku subagent fleet, \~79% off those calls' output.
* **Evidence tier:** T2 (peer-reviewed-track arXiv, exact tables). No published CoD results on Claude 4.x/5 thinking models — the technique predates adaptive thinking.
* **Quality risk:** **QUALITY-TRADE** on hard math/multi-step reasoning (cliff per #10); NEGATIVE-COST on classification/retrieval-shaped tasks (sports accuracy went UP). Known failure modes: zero-shot CoD on Claude 3.5 Sonnet "improved performance over direct answering by only 3.6%" (few-shot examples are load-bearing) and breaks down on \<3B models. Falsification: run your subagent task class with/without; reject if pass rate drops at all on reasoning-shaped tasks.
* **Availability:** CLAUDE-CODE-TODAY (for subagent prompts / sidecar API calls); pointless against the main Fable 5 loop.
* **Effort to adopt:** minutes-to-hours (one instruction + 2–3 few-shot examples in the subagent definition).
* **Composability:** does NOT stack with effort on thinking models — the model thinks privately AND drafts publicly = double-spend. Pair with `effort: low` Haiku subagents (#1) instead. Stacks with #5 for extraction sidecars.
* **Validation protocol:** 50-call eval on the actual subagent workload (e.g. investigator summaries) on Haiku 4.5, CoD vs plain, few-shot included; score with the main model as judge + spot-check; require ≤1 pt quality drop for ≥70% token cut.

### 5. Structured outputs (`output_config.format` json\_schema + strict tool use) [#5-structured-outputs-output_configformat-json_schema--strict-tool-use]

One-line pitch: grammar-constrained decoding kills both the prose wrapper around JSON and the 100%-loss class of parse-failure retries — GA on all current models including Fable 5, but with a documented cache-invalidation trap.

* **Layer:** output (format waste) + cache (pitfall).
* **Mechanism:** `output_config.format` with a JSON schema constrains decoding so the response IS the object — no fences, no "Here is the JSON you requested" (a measured 29 tok/call, table above), no trailing commentary. `tools[].strict: true` guarantees tool arguments match `input_schema`. "Reliable: No retries needed for schema violations." Grammar compiles on first use (latency), "cached for 24 hours from last use". Injects a hidden system prompt: "Your input token count is slightly higher" (size unquantified — Gaps #6).
* **Expected savings:** no published percentage. ESTIMATE: (wrapper ≈ 29 tok/call measured) + (parse-failure rate × full response cost); for a 500-tok extraction failing 5% of the time, retry elimination ≈ 25 tok/call expected value — the dominant term, not the prose trim. Main-loop impact ≈ 0 (not exposed there).
* **Evidence tier:** T1 (platform.claude.com/docs/en/build-with-claude/structured-outputs, all; wrapper figure local measurement).
* **Quality risk:** NEUTRAL on format. Two documented schema-violating edge cases: `stop_reason: "refusal"` (200 status, billed, schema overridden) and `stop_reason: "max_tokens"` (truncated JSON — "Retry with a higher max\_tokens"). Schema limits: no recursive schemas, no numeric `minimum`/`maximum`/`multipleOf`, max 20 strict tools, 24 optional params across strict schemas. Falsification: log schema-parse failures; should be exactly zero outside those two stop reasons.
* **Availability:** SDK (and subagent/SDK tool-schema paths). The Claude Code REPL does NOT expose schema-constrained final answers (open feature request, github.com/anthropics/claude-code/issues/9058) — though this very research task returns via a StructuredOutput tool schema, the subagent path.
* **Effort to adopt:** hours (schema design + pipeline wiring).
* **Composability:** **CACHE PITFALL** — "Changing the `output_config.format` parameter will invalidate any prompt cache for that conversation thread." Against the local profile (92.83% of prompt-side tokens are 0.1x cache reads), rotating schemas mid-thread converts reads back to 1x + 1.25x writes and can swamp the output saving. One stable schema per thread; see 13-caching-exploitation.md. Supersedes prefill (#6); redundant with stop sequences (#8) for JSON.
* **Validation protocol:** 200-call extraction batch, constrained vs unconstrained: measure parse-failure rate, total output tokens, input-token delta (quantifies the hidden prompt), and cache-hit rate across the thread. Accept if failures → 0 with input-token overhead \< wrapper savings and zero unintended cache invalidations.

### 6. Assistant prefill to skip preamble — a dying technique [#6-assistant-prefill-to-skip-preamble--a-dying-technique]

One-line pitch: the classic "prefill `{`" trick still works on non-thinking models but is documented incompatible with extended thinking — and Fable 5's thinking can't be disabled, so prefill is unavailable exactly where output is most expensive.

* **Layer:** output (request shape).
* **Mechanism:** send a partial assistant turn; the model continues from it, skipping preamble (docs: prefilling `{` "forces Claude to skip the preamble and directly output the JSON object"). Constraint: "Extended thinking is incompatible with prefilling… the API will return an error"; community migration guides document 400s on Opus 4.6+ adaptive-thinking requests with prefills (T3 sub-claim).
* **Expected savings:** the measured wrapper, 29 tok/call (local), at $50/MTok ≈ $0.0015/call — material only at pipeline volume.
* **Evidence tier:** T1 for mechanics/incompatibility (platform prefill doc — note: retrieved via search snippets; direct fetch redirected to the prompt-engineering overview, page may be mid-reorganization); T3 for the Opus 4.6 400-error reports (blog.laozhang.ai).
* **Quality risk:** NEUTRAL where it works; trailing-whitespace rules apply. Falsification: trivial — the API errors loudly when incompatible.
* **Availability:** SDK only, and only on Haiku 4.5 / thinking-disabled Sonnet. Impossible in Claude Code interactive and on Fable 5.
* **Effort to adopt:** minutes in raw API pipelines.
* **Composability:** mutually exclusive with thinking and with #1's thinking control; superseded by #5 (docs explicitly recommend structured outputs for guaranteed schemas). Treat as legacy for cheap-model sidecars only.
* **Validation protocol:** for a Haiku extraction sidecar, A/B prefill-`{` vs structured outputs on 100 calls: compare output tokens, input overhead (structured outputs' hidden prompt), latency (grammar compile), and failure rate; keep whichever wins — and re-test after any model bump, since this is the technique most likely to silently die.

### 7. `max_tokens` as a hard output budget — safety rail, not optimizer [#7-max_tokens-as-a-hard-output-budget--safety-rail-not-optimizer]

One-line pitch: a tight cap in agent loops *backfires*: a truncated tool\_use must be retried at a HIGHER limit, paying for the dead attempt in full.

* **Layer:** output / infra (circuit breaker).
* **Mechanism:** generation halts at the limit with `stop_reason: "max_tokens"`; the model does not plan toward the budget, it gets amputated mid-structure. Docs' own remediation: "If Claude's response is cut off due to hitting the max\_tokens limit, and the truncated response contains an incomplete tool use block, you'll need to retry the request with a higher max\_tokens value" . Anthropic's guidance for high-effort agents is the opposite of tight caps: "set a large max\_tokens… Starting at 64k tokens and tuning from there." On Fable 5, max\_tokens "is a hard limit on total output, thinking plus response text" — one pool.
* **Expected savings:** zero as an optimizer; positive only as a runaway-cost breaker (a 64k cap bounds worst-case single response at 64k × $50/MTok = $3.20).
* **Evidence tier:** T1 (platform.claude.com/docs/en/api/handling-stop-reasons + effort doc, both).
* **Quality risk:** **RISKY when set tight**: mid-structure truncation (schema-invalid JSON under #5 — documented edge case), lost work, retry double-billing. Degradation manifests as `max_tokens` stop reasons in logs. Falsification: any nonzero rate of max\_tokens stops on tool-bearing turns means the cap is destroying value.
* **Availability:** SDK (Claude Code manages it internally; not user-tunable per request in the REPL).
* **Effort to adopt:** minutes.
* **Composability:** use WITH effort (#1), never instead: effort shapes spend before generation, max\_tokens amputates after. Python SDK requires streaming above \~21k.
* **Validation protocol:** in pipelines, log stop\_reason distribution for a week; set max\_tokens at p99.9 of observed output + margin; alert on any `max_tokens` stop. Total cost must be ≤ uncapped baseline (it will be, since the cap should never fire).

### 8. Stop sequences for single-shot generation [#8-stop-sequences-for-single-shot-generation]

One-line pitch: sampler-enforced cutoff — "output the answer then END" + `stop_sequences:["END"]` means you never pay for trailing rationale or "let me know if…" closers in pipeline calls.

* **Layer:** output.
* **Mechanism:** custom strings halt decoding instantly (`stop_reason: "stop_sequence"`, matched value reported in `stop_sequence`). Unlike instructions, compliance is enforced by the sampler, not the model.
* **Expected savings:** no published numbers. ESTIMATE: bounded by typical trailing content, \~20–150 tok/call on extraction tasks; \~0 in Claude Code agent loops (turns already end on tool\_use/end\_turn, and thinking happens before text).
* **Evidence tier:** T1 for the mechanism (handling-stop-reasons doc); T4 for the savings magnitude on current models — Claude 4.6+/5 already end turns crisply; the advice was minted on 3.x-era chattiness.
* **Quality risk:** NEUTRAL with well-designed sentinels; RISKY if the sentinel can occur in legitimate output (premature cutoff). Falsification: scan outputs for the sentinel-as-content collision rate before deploying.
* **Availability:** SDK only (not exposed in Claude Code interactive).
* **Effort to adopt:** minutes.
* **Composability:** redundant with #5 (schema end = natural stop); useful where a schema is overkill (single values, free-text snippets). Stacks with #4/#6 on cheap-model sidecars.
* **Validation protocol:** 100-call A/B on the sidecar task; measure mean output tokens and answer-truncation rate (must be 0); adopt only if trailing-content baseline actually exists on the current model.

### 9. Terse-by-persona (output styles / caveman register) vs terse-by-instruction — **QUALITY-TRADE** at aggressive registers [#9-terse-by-persona-output-styles--caveman-register-vs-terse-by-instruction--quality-trade-at-aggressive-registers]

One-line pitch: persona-level brevity delivers a measured \~58% cut of visible prose and is structurally cache-stable — but its claimed durability edge over a plain instruction has never been measured, and research says style choice doesn't move the accuracy-length frontier anyway.

* **Layer:** output (visible prose), via system prompt (input-side rider, cached at 0.1x).
* **Mechanism:** Claude Code output styles replace part of the default system prompt (re-sent and cache-stable every turn); CLAUDE.md content rides as a user message after the default system prompt. The caveman plugin implements register-based persona compression as a skill. History: output styles deprecated v2.0.30 , un-deprecated v2.0.32 after backlash; `/output-style` command removed v2.1.91 but the `outputStyle` setting remains via `/config` (github.com/anthropics/claude-code/issues/10671).
* **Expected savings:** local Phase-0 ground truth : caveman ultra = &#x2A;*58.5%** token cut on visible prose (the plugin's "\~75%" claim is character-level folklore); wenyan-full = 80.9% char cut but only &#x2A;*56.6%*&#x2A; token cut (CJK ≈ 1.47 chars/tok vs 3.35 English); wenyan-ultra = 74.5% token cut. Applies ONLY to visible prose ≈ 17% of session dollars: upper bound 0.585 × $3.74 ≈ **$2.19/day** on the modeled profile if every visible token were prose (it isn't — tool args share that class, so real ceiling is lower).
* **Evidence tier:** T1 for the local cut measurements; T3 for plugin claims; T2 for the frontier argument — format/language variants "lie on a universal trade-off curve between response length and accuracy" incl. "in Chinese" and no-spaces variants (arXiv 2503.01141). The persona is a delivery vehicle for a length target, not a free lunch.
* **Quality risk:** **QUALITY-TRADE** at aggressive registers (#10 governs), plus a human-comprehension cost. Degradation manifests as dropped qualifiers and operator misreadings. Falsification: same task suite styled/unstyled, diff for lost caveats (grep negations/"not"/"however"), and have a second reader rate ambiguity.
* **Availability:** CLAUDE-CODE-TODAY (plugin install or `outputStyle` setting).
* **Effort to adopt:** minutes.
* **Composability:** persona text lives in the cached prefix (cheap at 0.1x); composes with #2/#3. Overlaps #1's low-effort terseness — stacking all three risks triple-pressuring length on hard tasks (see #10). Subagent variant (cavecrew, \~60% smaller results fed back into context) saves *input* downstream — quality side unmeasured (T4).
* **Validation protocol:** 10 mixed tasks × \{default, caveman-full, caveman-ultra}: measure visible-output tokens, task pass rate, caveat-retention (count negations/disclaimers in both outputs), operator comprehension quiz. Adopt the strongest register with zero caveat loss; the durability question (does persona hold over 100+ turns better than a turn-1 instruction?) is currently T4 — log tokens/turn over a long session to convert it.

### 10. The terseness floor: token-complexity thresholds, hallucination onset, negation loss [#10-the-terseness-floor-token-complexity-thresholds-hallucination-onset-negation-loss]

One-line pitch: the anti-claim that bounds every entry above — each problem has a sharp minimum-token threshold; "be concise" reaches only \~30% of safe compression; past \~10x, hallucination rises and the load-bearing small words (negations, disclaimers) drop first.

* **Layer:** output (design constraint over all of the above).
* **Mechanism:** token complexity (arXiv 2503.01141): per problem there exists a threshold τ "such that for any prompt, the LLM gets the answer correct iff the token length is above τ" — cliff, not slope; success is predictable from token count alone with accuracy "above 90% for most models and benchmarks". Safe compression therefore requires ADAPTIVE allocation (more tokens to harder problems) — which is exactly what adaptive thinking + effort implements at platform level and what blanket "always be brief" cannot do. Information-loss side: moderate (2–6x) prompt compression "typically preserves or even enhances LLM performance… while extremely aggressive ratios (>10x) incur pronounced information loss and sometimes increase hallucination rates" (arXiv 2505.00019); compression modules "may discard key information such as negation cues (e.g., 'not'), factual disclaimers, time and location markers" (CompressionAttack, arXiv 2510.22963). CCoT's −27.69% GPT-3.5 math penalty and CoD's zero-shot collapse are output-side instances of the same threshold effect.
* **Expected savings:** none — this bounds the others. Quantified headroom: BeConcise achieves 1.41x where 4.83x is theoretically safe (MMLU-Pro Math), 1.47x vs 11.16x (GSM8K), 1.26x vs 3.69x (MATH-500) — naive instructions capture under a third of safe compression, and the remainder requires per-problem adaptivity, not harder squeezing.
* **Evidence tier:** T2 (arXiv 2503.01141 verified directly; 2505.00019 / 2510.22963 / EMNLP-2025 aclanthology.org/2025.emnlp-main.389). The output-side analog of negation loss (do terse *output* registers drop caveats?) is unmeasured — T4 extrapolation.
* **Quality risk:** this entry IS the risk model: assume NEGATIVE-COST terseness up to \~2x on prose, QUALITY-TRADE beyond, cliff behavior past per-problem thresholds; audit compressed outputs for dropped negations/caveats specifically.
* **Availability:** NOT-USER-ACCESSIBLE (design principle, not a switch).
* **Effort to adopt:** n/a — practical rule: aim brevity pressure at easy/format-shaped content (summaries, confirmations, restatements — #2, #3, #5) and never at reasoning on hard problems (#1 medium-vs-high is the safe adaptive form; #4/#9 aggressive registers are the dangerous fixed form).
* **Composability:** explains why effort (adaptive) dominates fixed budgets (#7) and blanket personas (#9) at equal average savings.
* **Validation protocol:** for any adopted brevity stack, maintain a canary set of 5 hard-reasoning tasks; run weekly; any pass-rate drop triggers register rollback. Separately, diff terse-vs-default outputs for negation/disclaimer counts (closes the T4 gap).

***

## Claims to kill [#claims-to-kill]

| Folklore claim                                     | Why it dies                                                                                                                                                                                                    | Evidence                                               |
| -------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ |
| "CoD matches CoT at 7.6% of the tokens"            | 7.6% is the single best-case cell (sports understanding, Claude 3.5 Sonnet); on GSM8K CoD loses 4.3–4.4 pts at \~21% of tokens; zero-shot gain collapses to +3.6%                                              | arXiv 2502.18600 Tables 1/3                            |
| "Set max\_tokens low to cap agent spend"           | Docs instruct retrying truncated tool\_use WITH HIGHER max\_tokens — tight caps bill the dead attempt plus the retry; under structured outputs truncation yields invalid JSON                                  | handling-stop-reasons + structured-outputs docs        |
| "Concise CoT is free"                              | −27.69% accuracy on GPT-3.5 math despite "negligible" average impact; degradation below per-problem thresholds is a cliff                                                                                      | arXiv 2401.05618 + 2503.01141                          |
| "Prefill `{` is current best practice for JSON"    | Incompatible with extended thinking ("the API will return an error"); thinking is non-disableable on Fable 5; structured outputs is the documented replacement                                                 | prefill doc (via snippets) + model-config doc          |
| "Caveman/persona compression cuts \~75%"           | Local measurement: 58.5% token cut at ultra register — 75% is character-level folklore; wenyan's 80.9% char cut collapses to 56.6% token cut (CJK \~1.47 chars/tok); no encoding shortcut around length itself | local measurement + arXiv 2503.01141                   |
| "'Be concise' captures most available compression" | Measured at 1.41x of 4.83x-safe (MMLU-Pro Math), 1.47x of 11.16x (GSM8K) — under a third; the rest needs per-problem adaptivity                                                                                | arXiv 2503.01141                                       |
| "Run everything at max effort"                     | "max… may show diminishing returns and is prone to overthinking. Test before adopting broadly"; "adds significant cost for relatively small quality gains"                                                     | code.claude.com model-config + platform effort doc     |
| "Diff formats are only a token optimization"       | Primarily a QUALITY intervention (20%→61%, lazy comments 12→4 of 89); the 89.3% token saving is the secondary effect                                                                                           | aider.chat/docs/unified-diffs.html + local measurement |

## Gaps (highest-value follow-ups) [#gaps-highest-value-follow-ups]

1. **No quality-vs-effort curve for Fable 5 / Opus 4.8** — the 76%/48% figures are Opus 4.5 + SWE-bench only; the usage object doesn't report what effort "did", so the knob can't be audited post-hoc.
2. **Thinking-fraction-by-effort is unmeasured anywhere.** Local n=1: 54.8% of output at max effort. The #1 validation protocol closes this for free.
3. **Persona durability vs instruction decay** over 100+ turns/compaction: zero published measurements; currently mechanism-reasoning only (system prompt is re-sent and cache-stable; turn-1 instructions scroll into compacted history).
4. **Caveat-drop rates in terse OUTPUT registers**: negation loss is measured only for input compression; the output-side analog is plausible (T4) and matters for review/safety-adjacent summaries.
5. **Edit-vs-Write crossover under real scattered-change distributions** incl. failed-match retries (local single-hunk estimate: \~9.4 hunks); aider's quality numbers are GPT-4-Turbo-era.
6. **Structured outputs' hidden system-prompt size** ("slightly higher") — measurable by diffing count\_tokens against a live request with `output_config`; not run here because count\_tokens in the local harness does not accept `output_config`.
7. **Claude Code REPL still lacks schema-constrained final answers** (issue #9058 open); only SDK/subagent tool-schema paths get the guarantee.

## Verification ledger [#verification-ledger]

| Number / claim                                                                                                                                                                                                                                | Value                                                                                | Source + access date OR local method                                                                                                                           |
| --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Medium effort = Sonnet 4.5's best SWE-bench, fewer output tokens                                                                                                                                                                              | 76% fewer                                                                            | anthropic.com/news/claude-opus-4-5                                                                                                                             |
| Highest effort vs Sonnet 4.5                                                                                                                                                                                                                  | +4.3 pts, 48% fewer tokens                                                           | same                                                                                                                                                           |
| Long-horizon / Copilot effort savings                                                                                                                                                                                                         | "up to 65%" / "half"                                                                 | same                                                                                                                                                           |
| Effort = "behavioral signal"; levels; low-effort terse behaviors; Sonnet 4.6 medium-recommended; 64k max\_tokens guidance; Fable 5 "hard limit… thinking plus response text"                                                                  | quotes                                                                               | platform.claude.com/docs/en/build-with-claude/effort                                                                                                           |
| Claude Code effort defaults (high on Fable 5/Opus 4.8/4.6/Sonnet 4.6; xhigh on Opus 4.7); max session-only + "prone to overthinking"; ultracode/ultrathink semantics; per-model calibration; frontmatter `effort`                             | quotes                                                                               | code.claude.com/docs/en/model-config                                                                                                                           |
| "Thinking cannot be turned off on Fable 5… MAX\_THINKING\_TOKENS=0 have no effect there"                                                                                                                                                      | quote                                                                                | same                                                                                                                                                           |
| Edit vs Write payloads (200-line file, 5-line change)                                                                                                                                                                                         | 347 vs 3,197 tok = 89.3%                                                             | local: count\_tokens API, claude-fable-5, script /tmp/measure\_output.py                                                                                       |
| Raw-string variant                                                                                                                                                                                                                            | 256 vs 2,851 = 91.2%                                                                 | same local method                                                                                                                                              |
| Verbose vs terse post-edit summary                                                                                                                                                                                                            | 392 vs 40 tok = 352 saved (91.4%)                                                    | same local method                                                                                                                                              |
| "Here is the JSON…" wrapper overhead                                                                                                                                                                                                          | 29 tok/call                                                                          | same local method                                                                                                                                              |
| Write/Edit crossover                                                                                                                                                                                                                          | ≈9.4 hunks                                                                           | ESTIMATE: 3,190 ÷ 340 net tok, local figures                                                                                                                   |
| Aider unified-diff benchmark                                                                                                                                                                                                                  | 20%→61%; lazy 12→4 of 89 ("3X"); 9X errors w/o flexible patching; gpt-4-1106-preview | aider.chat/docs/unified-diffs.html                                                                                                                             |
| CoD GSM8K                                                                                                                                                                                                                                     | GPT-4o 95.4\@205.1 vs 91.1\@43.9; Claude 95.8\@190.0 vs 91.4\@39.8                   | arxiv.org/html/2502.18600v2 Table 1                                                                                                                            |
| CoD sports understanding ("7.6%")                                                                                                                                                                                                             | 93.2\@189.4 vs 97.3\@14.3                                                            | same, Table 3                                                                                                                                                  |
| CoD zero-shot collapse / \<3B limit                                                                                                                                                                                                           | +3.6% over direct                                                                    | same                                                                                                                                                           |
| CCoT                                                                                                                                                                                                                                          | −48.70% length; −27.69% GPT-3.5 math; −22.67% per-token cost                         | arxiv.org/abs/2401.05618 abstract                                                                                                                              |
| Token-complexity thresholds; universal curve; success prediction                                                                                                                                                                              | "above 90% for most models and benchmarks" (sweep's "95%" not confirmed — corrected) | arxiv.org/html/2503.01141v1                                                                                                                                    |
| BeConcise vs theoretical compression                                                                                                                                                                                                          | 1.41x/4.83x (MMLU-Pro Math); 1.47x/11.16x (GSM8K); 1.26x/3.69x (MATH-500)            | same                                                                                                                                                           |
| 2–6x safe / >10x hallucination increase                                                                                                                                                                                                       | quote                                                                                | arxiv.org/html/2505.00019v1                                                                                                                                    |
| Negation/disclaimer loss under compression                                                                                                                                                                                                    | quote                                                                                | arxiv.org/html/2510.22963v2 (CompressionAttack)                                                                                                                |
| Budget-dependent optimal model/prompt                                                                                                                                                                                                         | 30-LLM study                                                                         | aclanthology.org/2025.emnlp-main.389                                                                                                                           |
| Structured outputs: GA models incl. Fable 5; "No retries needed"; hidden prompt "slightly higher"; 24h grammar cache; cache invalidation; schema limits (no recursion/numerics, 20 strict tools, 24 optional); refusal/max\_tokens edge cases | quotes                                                                               | platform.claude.com/docs/en/build-with-claude/structured-outputs                                                                                               |
| Incomplete tool\_use → "retry the request with a higher max\_tokens"; stop\_sequences mechanics; \~21k streaming note                                                                                                                         | quotes                                                                               | platform.claude.com/docs/en/api/handling-stop-reasons                                                                                                          |
| Prefill `{` mechanics + thinking incompatibility                                                                                                                                                                                              | quotes                                                                               | platform prefill doc via search snippets (direct fetch redirected — page possibly mid-reorganization); 400s on Opus 4.6: blog.laozhang.ai, sweep-verified (T3) |
| Claude Code REPL lacks schema-constrained output                                                                                                                                                                                              | open issue                                                                           | github.com/anthropics/claude-code/issues/9058                                                                                                                  |
| Output styles deprecate/un-deprecate/v2.1.91 removal                                                                                                                                                                                          | timeline                                                                             | github.com/anthropics/claude-code/issues/10671                                                                                                                 |
| Caveman/wenyan register cuts                                                                                                                                                                                                                  | 58.5% / 56.6% / 74.5% token; 80.9% char                                              | local Phase-0 measurement                                                                                                                                      |
| Thinking share of output at max effort                                                                                                                                                                                                        | 54.8% (n=1)                                                                          | local Phase-0: usage.output\_tokens − count\_tokens(visible blocks)                                                                                            |
| Session dollar split (reads 32/writes 29/thinking 20/visible 17/uncached 2)                                                                                                                                                                   | modeled day \~$22                                                                    | local Phase-0 heavy-session profile — see 01-economics-and-measurement.md                                                                                      |
| Root <RepoFile path="AGENTS.md">AGENTS.md</RepoFile> always-on size                                                                                                                                                                           | 2,738 tok                                                                            | local Phase-0 measurement                                                                                                                                      |
| Claude Code system prompt "\<4 lines"/"minimize output tokens"                                                                                                                                                                                | community-documented                                                                 | NOT locally verifiable (v2.1.175 compiled Bun bundle; strings not greppable); T3                                                                               |
| Pricing (Fable 5 $10/$50; cache 0.1x/1.25x/2x; batch 50%)                                                                                                                                                                                     | reference                                                                            | task brief, consistent with live docs                                                                                                                          |
| Dollar arithmetic ($6.19/day effort ceiling; $1.43/day Edit; $0.88/day terse; $2.19/day persona ceiling)                                                                                                                                      | ESTIMATES                                                                            | arithmetic on modeled profile shown inline                                                                                                                     |