jackin'
ResearchToken Optimization Research

15 — Output discipline and structured generation

15 — Output discipline and structured generation

TL;DR

  • Output tokens are the 5x-priced class ($50/MTok out vs $10 in on Fable 5) and ~37% of a heavy day's dollars (thinking 20% + visible output 17%, local measurement). Locally, thinking is 54.8% of output tokens at max effort (n=1) — and on Fable 5 only the effort parameter can touch that share; thinking cannot be disabled.
  • Biggest sanctioned number in this space: Opus 4.5 at medium effort matched Sonnet 4.5's best SWE-bench Verified score with 76% fewer output tokens (Anthropic announcement). No equivalent curve exists for Fable 5 / Opus 4.8 — reproduce before standardizing.
  • Biggest measured-here number: Edit-style diffs beat whole-file rewrites by 89.3% on a realistic 5-line change to a 200-line repo file (347 vs 3,197 tok, count_tokens, claude-fable-5) — and aider's benchmark says diff formats simultaneously improve quality (20%→61%). Rare NEGATIVE-COST.
  • Suppressing post-edit code restatement saves a measured 352 tok/edit (91.4%); a "Here is the JSON…" wrapper costs a measured 29 tok/call.
  • Terseness has a floor: "be concise" captures only ~1.4x of a 4.8–11.2x theoretically-safe compression, degradation past per-problem token thresholds is a cliff not a slope, and >10x compression measurably increases hallucination. Route brevity at restatement and ceremony, never at reasoning. Techniques that trade quality are marked QUALITY-TRADE.

Why this layer matters

On the modeled heavy day (~$22, 6 sessions, ~27k output tok/session — see 01-economics-and-measurement.md), output is ~$8.14/day: ~$4.40 thinking + ~$3.74 visible (prose + tool arguments). Three structural facts shape everything below:

  1. Thinking is the majority of output (54.8% locally) and is prompt-immune on Fable 5. "Thinking cannot be turned off on Fable 5. The session toggle, alwaysThinkingEnabled, and MAX_THINKING_TOKENS=0 have no effect there" (code.claude.com/docs/en/model-config). Effort is the only knob.
  2. Tool arguments are output. A Write call's content field is model-generated text billed at $50/MTok. Edit-vs-Write is therefore an output-discipline question, not a tooling preference.
  3. Visible prose is the smallest slice (~17% of dollars) but the only slice prompts can address directly — which is why prompt-level brevity folklore systematically overpromises (it can't reach the other 83%).

Local measurements

Method: free count_tokens API, model claude-fable-5, each payload sent as a single user message; 1-char baseline message = 7 tok envelope, subtracted in % columns. Measurement script reproduced in 31-validation-harness.md style: simulate the exact JSON a tool call would carry. File under test: crates/jackin-config/src/app_config_workspaces.rs (200 lines); change under test: a realistic 5-line method addition with a 5-line unique anchor.

MeasurementTokensComparatorTokensSaved
Edit tool JSON payload (old_string+new_string)347Write tool JSON payload (full new file)3,19789.3%
Same, raw strings without JSON escaping256raw full file content2,85191.2%
Terse post-edit summary (Added has_workspace at <file>:20.)40verbose summary restating an 11-line code block + narration + closer392352 tok (91.4%)
Bare JSON object (extraction answer)35same object in "Here is the JSON…" + fences + closer6429 tok/call
Write-vs-Edit crossover (ESTIMATE from above)3,190 net Write ÷ 340 net per-hunk Edit≈ 9.4 hunks

These independently reproduce the Phase-0 sweep numbers (89.5% / 302 tok / 15–25 tok est.) within noise of payload phrasing. At $50/MTok: an avoided whole-file rewrite of this file saves ~$0.143; an avoided verbose summary ~$0.018; an avoided JSON wrapper ~$0.0015.


Techniques

1. Effort parameter (/effort, output_config.effort)

One-line pitch: Anthropic's only sanctioned whole-response throttle — one knob that shrinks thinking, prose, and tool-call count together, with the strongest published number in this entire dossier.

  • Layer: output (all sub-classes: thinking, text, tool arguments).
  • Mechanism: "Effort is a behavioral signal, not a strict token budget" (platform effort doc). Levels low|medium|high|xhigh|max; high "produces exactly the same behavior as omitting the parameter". On adaptive-thinking models (Fable 5, Opus 4.7+) effort is the recommended/only thinking-depth control; budget_tokens is deprecated (4.6) or rejected (4.7+). Lower effort documented to "combine multiple operations into fewer tool calls… proceed directly to action without preamble… use terse confirmation messages". Claude Code surfaces: /effort slider, --effort, effortLevel setting, CLAUDE_CODE_EFFORT_LEVEL env var, and per-skill/per-subagent effort: frontmatter. Defaults in Claude Code: high on Fable 5/Opus 4.8/Opus 4.6/Sonnet 4.6, xhigh on Opus 4.7 (raw API default is high everywhere — different surfaces, both). max is session-only and "prone to overthinking. Test before adopting broadly". ultracode = xhigh + workflow orchestration, not an API level; ultrathink = one-turn in-context instruction, API effort unchanged.
  • Expected savings: T1 anchor: "Set to a medium effort level, Opus 4.5 matches Sonnet 4.5's best score on SWE-bench Verified, but uses 76% fewer output tokens"; at highest effort "+4.3 percentage points… 48% fewer tokens"; long-horizon "up to 65% fewer"; GitHub Copilot "cutting token usage in half" (Opus 4.5 announcement). ESTIMATE on the modeled profile if 76% transferred to Fable 5 high→medium: 0.76 × $8.14 ≈ $6.19/day, plus unmodeled second-order savings from fewer tool calls (smaller cache-write growth). That transfer is unproven — the figures are Opus 4.5 + SWE-bench only; the effort scale "is calibrated per model, so the same level name does not represent the same underlying value across models".
  • Evidence tier: T1 (anthropic.com/news/claude-opus-4-5; platform.claude.com/docs/en/build-with-claude/effort; code.claude.com/docs/en/model-config — all).
  • Quality risk: stepping down from max: NEGATIVE-COST (docs: max "may show diminishing returns and is prone to overthinking"). high→medium/low: QUALITY-TRADE — Opus 4.7+ "respects effort levels more strictly… scopes its work to what was asked rather than going above and beyond"; degradation manifests as shallower investigation, skipped verification, fewer edge cases handled. Notably Anthropic itself recommends medium as the default for Sonnet 4.6. Falsification: your eval suite passes at medium with ≥ high's pass rate.
  • Availability: CLAUDE-CODE-TODAY.
  • Effort to adopt: minutes (/effort medium; effort: low in subagent frontmatter).
  • Composability: composes with everything below; it is the only lever on the 54.8% thinking share. Per-subagent effort: low under a high-effort main loop is a shipped, documented pattern almost no cost guide mentions. Anti-synergy: partially redundant with terse-by-instruction (#3) — low effort already produces terse confirmations.
  • Validation protocol: fix a 10-task repo-representative suite (bugfix, refactor, test-add). Run 3x at high and at medium with identical prompts. Record usage.output_tokens, wall time, pass rate (tests green), and thinking share (output_tokens minus count_tokens of visible blocks — same method as Phase-0). Adopt medium where pass-rate delta ≤ 0; this experiment also closes the open thinking-fraction-by-effort gap (#2 in Gaps).

2. Diff-style edits (Edit) instead of whole-file rewrites (Write)

One-line pitch: a 5-line change costs 10x less via Edit than Write — measured here at 89.3% — and diff-style editing independently improves code quality, making this the cleanest NEGATIVE-COST item in the dossier.

  • Layer: output (tool arguments).
  • Mechanism: Edit emits only old_string+new_string (change + minimal unique anchor); Write re-emits the entire file as generated output billed at $50/MTok. Aider's complementary quality finding: diff-style formats make the model treat edits "as data" rather than chat, suppressing lazy placeholder comments; their flexible patch-application layer is load-bearing ("Experiments where flexible patching is disabled show a 9X increase in editing errors").
  • Expected savings: (table above): 347 vs 3,197 tok JSON payloads = 89.3% (10.4x) on a 200-line file; $0.160 vs $0.017 per edit. Savings scale with file-size:change-size ratio. ESTIMATE on modeled profile: if 10 would-be whole-file rewrites/day become Edits, ~28.5k output tok ≈ $1.43/day. Crossover ESTIMATE: Write becomes cheaper only past ~9.4 scattered hunks of this size on this file (3,190 ÷ 340); below that, multiple Edit calls still win.
  • Evidence tier: T1 — local reproduction (method above) + aider benchmark: GPT-4 Turbo (gpt-4-1106-preview) 20% → 61% on an 89-task refactoring benchmark, lazy comments on 12 → 4 of 89 tasks ("reduced laziness by 3X") (aider.chat/docs/unified-diffs.html). Caveat: aider numbers are Dec-2023/GPT-4-Turbo era — stale for Claude 4.6+/5, though Anthropic keeping string-replace as Claude Code's default editing primitive is implicit confirmation.
  • Quality risk: NEGATIVE-COST. Residual: failed old_string matches force retry round-trips; degradation manifests as repeated "String to replace not found" tool errors. Falsification: count failed-match retries per 100 edits before/after; if retry waste exceeds rewrite savings on your codebase, the anchor-discipline (not the technique) is broken.
  • Availability: CLAUDE-CODE-TODAY (default behavior).
  • Effort to adopt: zero for defaults; minutes for a one-line CLAUDE.md guard: "prefer Edit over Write for existing files; never rewrite a file to change a few lines."
  • Composability: disjoint from #3 (tool args vs prose) — they stack. Orthogonal to effort. Anti-synergy: none known.
  • Validation protocol: replay 20 historical edits from transcripts both ways (reconstruct the Write payload from the post-state); count both payloads via count_tokens; cargo check the results for equivalence; track failed-match retry rate. Accept if retries × full-payload cost < 10% of measured savings.

3. Suppress restated code and post-action narration (terse-by-instruction)

One-line pitch: every "Here's what the updated section now looks like:" + re-quoted block costs a measured 352 tokens the user can already see in the tool-call diff.

  • Layer: output (visible prose).
  • Mechanism: CLAUDE.md/system-prompt instruction: reference file:line instead of re-quoting; no preamble/postamble; no offer-to-do-more closers. The transcript already renders the diff, so restatement is pure duplication. Strongest precedent: Claude Code's own shipped system prompt — community-documented versions instruct answering in "fewer than 4 lines… unless user asks for detail" and to "minimize output tokens" (NOT locally verifiable: the v2.1.175 binary is a compiled Bun bundle, strings not greppable; T3 for that sub-claim).
  • Expected savings:: verbose post-edit summary 392 tok vs terse 40 tok = 352 tok/edit (91.4%). ESTIMATE: 50 edits/heavy-day × 352 tok = 17.6k tok ≈ $0.88/day at $50/MTok; per modeled session, 10 verbose summaries ≈ 3.5k tok ≈ 29% of the 12.2k visible-output budget. Measure your actual restatement frequency before claiming the multiple.
  • Evidence tier: T1 for the per-instance delta (local measurement); T3 for the shipped-prompt precedent; the documented low-effort behaviors ("proceed directly to action without preamble… terse confirmation messages", effort doc) confirm the same behaviors are sanctioned.
  • Quality risk: NEUTRAL→NEGATIVE-COST for code restatement (information is duplicated); RISKY if extended to caveats/warnings — see #10. Degradation manifests as the operator re-asking "what changed?" or missing a warning that was squeezed out. Falsification: A/B the rule and count re-ask turns.
  • Availability: CLAUDE-CODE-TODAY.
  • Effort to adopt: minutes (1–2 CLAUDE.md lines). The rule itself is always-on input (root AGENTS.md is already 2,738 tok, local measurement) but cached at 0.1x — a ~20-tok rule pays for itself after a single avoided restatement.
  • Composability: stacks cleanly with #2 (disjoint token populations). Partially redundant with effort=low (#1) which already produces terse confirmations — adopting both, attribute savings carefully. Unknown: instruction persistence over 100+ turn sessions (Gaps #3).
  • Validation protocol: two matched days of real sessions with/without the rule; per session count visible-output tokens per edit action (transcript parse) and operator re-ask rate. Accept if tokens/edit drops ≥30% with zero re-ask increase.

4. Chain of Draft / Concise CoT (visible-reasoning compression) — QUALITY-TRADE

One-line pitch: "keep each thinking step to 5 words" cuts response tokens ~79–92% on benchmark tasks — but the famous "7.6% of tokens" is the single best-case cell, and on GSM8K it costs 4.3–4.4 accuracy points.

  • Layer: output (visible prose on non-thinking calls).
  • Mechanism: prompt-level cap on per-step (CoD) or whole-response (CCoT "Be concise") length, applied to visible chain-of-thought. Critical Claude Code caveat: on Fable 5/Opus 4.7+ reasoning happens in billed thinking blocks that prompts do not cap (effort does); CoD-style instructions only bite on non-thinking calls — Haiku 4.5 sidecars, thinking-disabled Sonnet subagents — or on visible answer structure.
  • Expected savings: CoD (arXiv 2502.18600, Tables 1/3): GSM8K GPT-4o CoT 95.4% @ 205.1 tok vs CoD 91.1% @ 43.9 tok (−78.6% tokens, −4.3 pts); Claude 3.5 Sonnet 95.8% @ 190.0 vs 91.4% @ 39.8 (−79.1%, −4.4 pts); sports understanding Claude CoT 93.2% @ 189.4 vs CoD 97.3% @ 14.3 — the "7.6% of tokens" headline AND +4.1 pts. CCoT (arXiv 2401.05618): −48.70% response length both GPT-3.5/4, −22.67% average per-token cost, but −27.69% accuracy on math for GPT-3.5. On the modeled profile: main-loop impact ≈ 0 (thinking-immune); for a Haiku subagent fleet, ~79% off those calls' output.
  • Evidence tier: T2 (peer-reviewed-track arXiv, exact tables). No published CoD results on Claude 4.x/5 thinking models — the technique predates adaptive thinking.
  • Quality risk: QUALITY-TRADE on hard math/multi-step reasoning (cliff per #10); NEGATIVE-COST on classification/retrieval-shaped tasks (sports accuracy went UP). Known failure modes: zero-shot CoD on Claude 3.5 Sonnet "improved performance over direct answering by only 3.6%" (few-shot examples are load-bearing) and breaks down on <3B models. Falsification: run your subagent task class with/without; reject if pass rate drops at all on reasoning-shaped tasks.
  • Availability: CLAUDE-CODE-TODAY (for subagent prompts / sidecar API calls); pointless against the main Fable 5 loop.
  • Effort to adopt: minutes-to-hours (one instruction + 2–3 few-shot examples in the subagent definition).
  • Composability: does NOT stack with effort on thinking models — the model thinks privately AND drafts publicly = double-spend. Pair with effort: low Haiku subagents (#1) instead. Stacks with #5 for extraction sidecars.
  • Validation protocol: 50-call eval on the actual subagent workload (e.g. investigator summaries) on Haiku 4.5, CoD vs plain, few-shot included; score with the main model as judge + spot-check; require ≤1 pt quality drop for ≥70% token cut.

5. Structured outputs (output_config.format json_schema + strict tool use)

One-line pitch: grammar-constrained decoding kills both the prose wrapper around JSON and the 100%-loss class of parse-failure retries — GA on all current models including Fable 5, but with a documented cache-invalidation trap.

  • Layer: output (format waste) + cache (pitfall).
  • Mechanism: output_config.format with a JSON schema constrains decoding so the response IS the object — no fences, no "Here is the JSON you requested" (a measured 29 tok/call, table above), no trailing commentary. tools[].strict: true guarantees tool arguments match input_schema. "Reliable: No retries needed for schema violations." Grammar compiles on first use (latency), "cached for 24 hours from last use". Injects a hidden system prompt: "Your input token count is slightly higher" (size unquantified — Gaps #6).
  • Expected savings: no published percentage. ESTIMATE: (wrapper ≈ 29 tok/call measured) + (parse-failure rate × full response cost); for a 500-tok extraction failing 5% of the time, retry elimination ≈ 25 tok/call expected value — the dominant term, not the prose trim. Main-loop impact ≈ 0 (not exposed there).
  • Evidence tier: T1 (platform.claude.com/docs/en/build-with-claude/structured-outputs, all; wrapper figure local measurement).
  • Quality risk: NEUTRAL on format. Two documented schema-violating edge cases: stop_reason: "refusal" (200 status, billed, schema overridden) and stop_reason: "max_tokens" (truncated JSON — "Retry with a higher max_tokens"). Schema limits: no recursive schemas, no numeric minimum/maximum/multipleOf, max 20 strict tools, 24 optional params across strict schemas. Falsification: log schema-parse failures; should be exactly zero outside those two stop reasons.
  • Availability: SDK (and subagent/SDK tool-schema paths). The Claude Code REPL does NOT expose schema-constrained final answers (open feature request, github.com/anthropics/claude-code/issues/9058) — though this very research task returns via a StructuredOutput tool schema, the subagent path.
  • Effort to adopt: hours (schema design + pipeline wiring).
  • Composability: CACHE PITFALL — "Changing the output_config.format parameter will invalidate any prompt cache for that conversation thread." Against the local profile (92.83% of prompt-side tokens are 0.1x cache reads), rotating schemas mid-thread converts reads back to 1x + 1.25x writes and can swamp the output saving. One stable schema per thread; see 13-caching-exploitation.md. Supersedes prefill (#6); redundant with stop sequences (#8) for JSON.
  • Validation protocol: 200-call extraction batch, constrained vs unconstrained: measure parse-failure rate, total output tokens, input-token delta (quantifies the hidden prompt), and cache-hit rate across the thread. Accept if failures → 0 with input-token overhead < wrapper savings and zero unintended cache invalidations.

6. Assistant prefill to skip preamble — a dying technique

One-line pitch: the classic "prefill {" trick still works on non-thinking models but is documented incompatible with extended thinking — and Fable 5's thinking can't be disabled, so prefill is unavailable exactly where output is most expensive.

  • Layer: output (request shape).
  • Mechanism: send a partial assistant turn; the model continues from it, skipping preamble (docs: prefilling { "forces Claude to skip the preamble and directly output the JSON object"). Constraint: "Extended thinking is incompatible with prefilling… the API will return an error"; community migration guides document 400s on Opus 4.6+ adaptive-thinking requests with prefills (T3 sub-claim).
  • Expected savings: the measured wrapper, 29 tok/call (local), at $50/MTok ≈ $0.0015/call — material only at pipeline volume.
  • Evidence tier: T1 for mechanics/incompatibility (platform prefill doc — note: retrieved via search snippets; direct fetch redirected to the prompt-engineering overview, page may be mid-reorganization); T3 for the Opus 4.6 400-error reports (blog.laozhang.ai).
  • Quality risk: NEUTRAL where it works; trailing-whitespace rules apply. Falsification: trivial — the API errors loudly when incompatible.
  • Availability: SDK only, and only on Haiku 4.5 / thinking-disabled Sonnet. Impossible in Claude Code interactive and on Fable 5.
  • Effort to adopt: minutes in raw API pipelines.
  • Composability: mutually exclusive with thinking and with #1's thinking control; superseded by #5 (docs explicitly recommend structured outputs for guaranteed schemas). Treat as legacy for cheap-model sidecars only.
  • Validation protocol: for a Haiku extraction sidecar, A/B prefill-{ vs structured outputs on 100 calls: compare output tokens, input overhead (structured outputs' hidden prompt), latency (grammar compile), and failure rate; keep whichever wins — and re-test after any model bump, since this is the technique most likely to silently die.

7. max_tokens as a hard output budget — safety rail, not optimizer

One-line pitch: a tight cap in agent loops backfires: a truncated tool_use must be retried at a HIGHER limit, paying for the dead attempt in full.

  • Layer: output / infra (circuit breaker).
  • Mechanism: generation halts at the limit with stop_reason: "max_tokens"; the model does not plan toward the budget, it gets amputated mid-structure. Docs' own remediation: "If Claude's response is cut off due to hitting the max_tokens limit, and the truncated response contains an incomplete tool use block, you'll need to retry the request with a higher max_tokens value" . Anthropic's guidance for high-effort agents is the opposite of tight caps: "set a large max_tokens… Starting at 64k tokens and tuning from there." On Fable 5, max_tokens "is a hard limit on total output, thinking plus response text" — one pool.
  • Expected savings: zero as an optimizer; positive only as a runaway-cost breaker (a 64k cap bounds worst-case single response at 64k × $50/MTok = $3.20).
  • Evidence tier: T1 (platform.claude.com/docs/en/api/handling-stop-reasons + effort doc, both).
  • Quality risk: RISKY when set tight: mid-structure truncation (schema-invalid JSON under #5 — documented edge case), lost work, retry double-billing. Degradation manifests as max_tokens stop reasons in logs. Falsification: any nonzero rate of max_tokens stops on tool-bearing turns means the cap is destroying value.
  • Availability: SDK (Claude Code manages it internally; not user-tunable per request in the REPL).
  • Effort to adopt: minutes.
  • Composability: use WITH effort (#1), never instead: effort shapes spend before generation, max_tokens amputates after. Python SDK requires streaming above ~21k.
  • Validation protocol: in pipelines, log stop_reason distribution for a week; set max_tokens at p99.9 of observed output + margin; alert on any max_tokens stop. Total cost must be ≤ uncapped baseline (it will be, since the cap should never fire).

8. Stop sequences for single-shot generation

One-line pitch: sampler-enforced cutoff — "output the answer then END" + stop_sequences:["END"] means you never pay for trailing rationale or "let me know if…" closers in pipeline calls.

  • Layer: output.
  • Mechanism: custom strings halt decoding instantly (stop_reason: "stop_sequence", matched value reported in stop_sequence). Unlike instructions, compliance is enforced by the sampler, not the model.
  • Expected savings: no published numbers. ESTIMATE: bounded by typical trailing content, ~20–150 tok/call on extraction tasks; ~0 in Claude Code agent loops (turns already end on tool_use/end_turn, and thinking happens before text).
  • Evidence tier: T1 for the mechanism (handling-stop-reasons doc); T4 for the savings magnitude on current models — Claude 4.6+/5 already end turns crisply; the advice was minted on 3.x-era chattiness.
  • Quality risk: NEUTRAL with well-designed sentinels; RISKY if the sentinel can occur in legitimate output (premature cutoff). Falsification: scan outputs for the sentinel-as-content collision rate before deploying.
  • Availability: SDK only (not exposed in Claude Code interactive).
  • Effort to adopt: minutes.
  • Composability: redundant with #5 (schema end = natural stop); useful where a schema is overkill (single values, free-text snippets). Stacks with #4/#6 on cheap-model sidecars.
  • Validation protocol: 100-call A/B on the sidecar task; measure mean output tokens and answer-truncation rate (must be 0); adopt only if trailing-content baseline actually exists on the current model.

9. Terse-by-persona (output styles / caveman register) vs terse-by-instruction — QUALITY-TRADE at aggressive registers

One-line pitch: persona-level brevity delivers a measured ~58% cut of visible prose and is structurally cache-stable — but its claimed durability edge over a plain instruction has never been measured, and research says style choice doesn't move the accuracy-length frontier anyway.

  • Layer: output (visible prose), via system prompt (input-side rider, cached at 0.1x).
  • Mechanism: Claude Code output styles replace part of the default system prompt (re-sent and cache-stable every turn); CLAUDE.md content rides as a user message after the default system prompt. The caveman plugin implements register-based persona compression as a skill. History: output styles deprecated v2.0.30 , un-deprecated v2.0.32 after backlash; /output-style command removed v2.1.91 but the outputStyle setting remains via /config (github.com/anthropics/claude-code/issues/10671).
  • Expected savings: local Phase-0 ground truth : caveman ultra = 58.5% token cut on visible prose (the plugin's "~75%" claim is character-level folklore); wenyan-full = 80.9% char cut but only 56.6% token cut (CJK ≈ 1.47 chars/tok vs 3.35 English); wenyan-ultra = 74.5% token cut. Applies ONLY to visible prose ≈ 17% of session dollars: upper bound 0.585 × $3.74 ≈ $2.19/day on the modeled profile if every visible token were prose (it isn't — tool args share that class, so real ceiling is lower).
  • Evidence tier: T1 for the local cut measurements; T3 for plugin claims; T2 for the frontier argument — format/language variants "lie on a universal trade-off curve between response length and accuracy" incl. "in Chinese" and no-spaces variants (arXiv 2503.01141). The persona is a delivery vehicle for a length target, not a free lunch.
  • Quality risk: QUALITY-TRADE at aggressive registers (#10 governs), plus a human-comprehension cost. Degradation manifests as dropped qualifiers and operator misreadings. Falsification: same task suite styled/unstyled, diff for lost caveats (grep negations/"not"/"however"), and have a second reader rate ambiguity.
  • Availability: CLAUDE-CODE-TODAY (plugin install or outputStyle setting).
  • Effort to adopt: minutes.
  • Composability: persona text lives in the cached prefix (cheap at 0.1x); composes with #2/#3. Overlaps #1's low-effort terseness — stacking all three risks triple-pressuring length on hard tasks (see #10). Subagent variant (cavecrew, ~60% smaller results fed back into context) saves input downstream — quality side unmeasured (T4).
  • Validation protocol: 10 mixed tasks × {default, caveman-full, caveman-ultra}: measure visible-output tokens, task pass rate, caveat-retention (count negations/disclaimers in both outputs), operator comprehension quiz. Adopt the strongest register with zero caveat loss; the durability question (does persona hold over 100+ turns better than a turn-1 instruction?) is currently T4 — log tokens/turn over a long session to convert it.

10. The terseness floor: token-complexity thresholds, hallucination onset, negation loss

One-line pitch: the anti-claim that bounds every entry above — each problem has a sharp minimum-token threshold; "be concise" reaches only ~30% of safe compression; past ~10x, hallucination rises and the load-bearing small words (negations, disclaimers) drop first.

  • Layer: output (design constraint over all of the above).
  • Mechanism: token complexity (arXiv 2503.01141): per problem there exists a threshold τ "such that for any prompt, the LLM gets the answer correct iff the token length is above τ" — cliff, not slope; success is predictable from token count alone with accuracy "above 90% for most models and benchmarks". Safe compression therefore requires ADAPTIVE allocation (more tokens to harder problems) — which is exactly what adaptive thinking + effort implements at platform level and what blanket "always be brief" cannot do. Information-loss side: moderate (2–6x) prompt compression "typically preserves or even enhances LLM performance… while extremely aggressive ratios (>10x) incur pronounced information loss and sometimes increase hallucination rates" (arXiv 2505.00019); compression modules "may discard key information such as negation cues (e.g., 'not'), factual disclaimers, time and location markers" (CompressionAttack, arXiv 2510.22963). CCoT's −27.69% GPT-3.5 math penalty and CoD's zero-shot collapse are output-side instances of the same threshold effect.
  • Expected savings: none — this bounds the others. Quantified headroom: BeConcise achieves 1.41x where 4.83x is theoretically safe (MMLU-Pro Math), 1.47x vs 11.16x (GSM8K), 1.26x vs 3.69x (MATH-500) — naive instructions capture under a third of safe compression, and the remainder requires per-problem adaptivity, not harder squeezing.
  • Evidence tier: T2 (arXiv 2503.01141 verified directly; 2505.00019 / 2510.22963 / EMNLP-2025 aclanthology.org/2025.emnlp-main.389). The output-side analog of negation loss (do terse output registers drop caveats?) is unmeasured — T4 extrapolation.
  • Quality risk: this entry IS the risk model: assume NEGATIVE-COST terseness up to ~2x on prose, QUALITY-TRADE beyond, cliff behavior past per-problem thresholds; audit compressed outputs for dropped negations/caveats specifically.
  • Availability: NOT-USER-ACCESSIBLE (design principle, not a switch).
  • Effort to adopt: n/a — practical rule: aim brevity pressure at easy/format-shaped content (summaries, confirmations, restatements — #2, #3, #5) and never at reasoning on hard problems (#1 medium-vs-high is the safe adaptive form; #4/#9 aggressive registers are the dangerous fixed form).
  • Composability: explains why effort (adaptive) dominates fixed budgets (#7) and blanket personas (#9) at equal average savings.
  • Validation protocol: for any adopted brevity stack, maintain a canary set of 5 hard-reasoning tasks; run weekly; any pass-rate drop triggers register rollback. Separately, diff terse-vs-default outputs for negation/disclaimer counts (closes the T4 gap).

Claims to kill

Folklore claimWhy it diesEvidence
"CoD matches CoT at 7.6% of the tokens"7.6% is the single best-case cell (sports understanding, Claude 3.5 Sonnet); on GSM8K CoD loses 4.3–4.4 pts at ~21% of tokens; zero-shot gain collapses to +3.6%arXiv 2502.18600 Tables 1/3
"Set max_tokens low to cap agent spend"Docs instruct retrying truncated tool_use WITH HIGHER max_tokens — tight caps bill the dead attempt plus the retry; under structured outputs truncation yields invalid JSONhandling-stop-reasons + structured-outputs docs
"Concise CoT is free"−27.69% accuracy on GPT-3.5 math despite "negligible" average impact; degradation below per-problem thresholds is a cliffarXiv 2401.05618 + 2503.01141
"Prefill { is current best practice for JSON"Incompatible with extended thinking ("the API will return an error"); thinking is non-disableable on Fable 5; structured outputs is the documented replacementprefill doc (via snippets) + model-config doc
"Caveman/persona compression cuts ~75%"Local measurement: 58.5% token cut at ultra register — 75% is character-level folklore; wenyan's 80.9% char cut collapses to 56.6% token cut (CJK ~1.47 chars/tok); no encoding shortcut around length itselflocal measurement + arXiv 2503.01141
"'Be concise' captures most available compression"Measured at 1.41x of 4.83x-safe (MMLU-Pro Math), 1.47x of 11.16x (GSM8K) — under a third; the rest needs per-problem adaptivityarXiv 2503.01141
"Run everything at max effort""max… may show diminishing returns and is prone to overthinking. Test before adopting broadly"; "adds significant cost for relatively small quality gains"code.claude.com model-config + platform effort doc
"Diff formats are only a token optimization"Primarily a QUALITY intervention (20%→61%, lazy comments 12→4 of 89); the 89.3% token saving is the secondary effectaider.chat/docs/unified-diffs.html + local measurement

Gaps (highest-value follow-ups)

  1. No quality-vs-effort curve for Fable 5 / Opus 4.8 — the 76%/48% figures are Opus 4.5 + SWE-bench only; the usage object doesn't report what effort "did", so the knob can't be audited post-hoc.
  2. Thinking-fraction-by-effort is unmeasured anywhere. Local n=1: 54.8% of output at max effort. The #1 validation protocol closes this for free.
  3. Persona durability vs instruction decay over 100+ turns/compaction: zero published measurements; currently mechanism-reasoning only (system prompt is re-sent and cache-stable; turn-1 instructions scroll into compacted history).
  4. Caveat-drop rates in terse OUTPUT registers: negation loss is measured only for input compression; the output-side analog is plausible (T4) and matters for review/safety-adjacent summaries.
  5. Edit-vs-Write crossover under real scattered-change distributions incl. failed-match retries (local single-hunk estimate: ~9.4 hunks); aider's quality numbers are GPT-4-Turbo-era.
  6. Structured outputs' hidden system-prompt size ("slightly higher") — measurable by diffing count_tokens against a live request with output_config; not run here because count_tokens in the local harness does not accept output_config.
  7. Claude Code REPL still lacks schema-constrained final answers (issue #9058 open); only SDK/subagent tool-schema paths get the guarantee.

Verification ledger

Number / claimValueSource + access date OR local method
Medium effort = Sonnet 4.5's best SWE-bench, fewer output tokens76% feweranthropic.com/news/claude-opus-4-5
Highest effort vs Sonnet 4.5+4.3 pts, 48% fewer tokenssame
Long-horizon / Copilot effort savings"up to 65%" / "half"same
Effort = "behavioral signal"; levels; low-effort terse behaviors; Sonnet 4.6 medium-recommended; 64k max_tokens guidance; Fable 5 "hard limit… thinking plus response text"quotesplatform.claude.com/docs/en/build-with-claude/effort
Claude Code effort defaults (high on Fable 5/Opus 4.8/4.6/Sonnet 4.6; xhigh on Opus 4.7); max session-only + "prone to overthinking"; ultracode/ultrathink semantics; per-model calibration; frontmatter effortquotescode.claude.com/docs/en/model-config
"Thinking cannot be turned off on Fable 5… MAX_THINKING_TOKENS=0 have no effect there"quotesame
Edit vs Write payloads (200-line file, 5-line change)347 vs 3,197 tok = 89.3%local: count_tokens API, claude-fable-5, script /tmp/measure_output.py
Raw-string variant256 vs 2,851 = 91.2%same local method
Verbose vs terse post-edit summary392 vs 40 tok = 352 saved (91.4%)same local method
"Here is the JSON…" wrapper overhead29 tok/callsame local method
Write/Edit crossover≈9.4 hunksESTIMATE: 3,190 ÷ 340 net tok, local figures
Aider unified-diff benchmark20%→61%; lazy 12→4 of 89 ("3X"); 9X errors w/o flexible patching; gpt-4-1106-previewaider.chat/docs/unified-diffs.html
CoD GSM8KGPT-4o 95.4@205.1 vs 91.1@43.9; Claude 95.8@190.0 vs 91.4@39.8arxiv.org/html/2502.18600v2 Table 1
CoD sports understanding ("7.6%")93.2@189.4 vs 97.3@14.3same, Table 3
CoD zero-shot collapse / <3B limit+3.6% over directsame
CCoT−48.70% length; −27.69% GPT-3.5 math; −22.67% per-token costarxiv.org/abs/2401.05618 abstract
Token-complexity thresholds; universal curve; success prediction"above 90% for most models and benchmarks" (sweep's "95%" not confirmed — corrected)arxiv.org/html/2503.01141v1
BeConcise vs theoretical compression1.41x/4.83x (MMLU-Pro Math); 1.47x/11.16x (GSM8K); 1.26x/3.69x (MATH-500)same
2–6x safe / >10x hallucination increasequotearxiv.org/html/2505.00019v1
Negation/disclaimer loss under compressionquotearxiv.org/html/2510.22963v2 (CompressionAttack)
Budget-dependent optimal model/prompt30-LLM studyaclanthology.org/2025.emnlp-main.389
Structured outputs: GA models incl. Fable 5; "No retries needed"; hidden prompt "slightly higher"; 24h grammar cache; cache invalidation; schema limits (no recursion/numerics, 20 strict tools, 24 optional); refusal/max_tokens edge casesquotesplatform.claude.com/docs/en/build-with-claude/structured-outputs
Incomplete tool_use → "retry the request with a higher max_tokens"; stop_sequences mechanics; ~21k streaming notequotesplatform.claude.com/docs/en/api/handling-stop-reasons
Prefill { mechanics + thinking incompatibilityquotesplatform prefill doc via search snippets (direct fetch redirected — page possibly mid-reorganization); 400s on Opus 4.6: blog.laozhang.ai, sweep-verified (T3)
Claude Code REPL lacks schema-constrained outputopen issuegithub.com/anthropics/claude-code/issues/9058
Output styles deprecate/un-deprecate/v2.1.91 removaltimelinegithub.com/anthropics/claude-code/issues/10671
Caveman/wenyan register cuts58.5% / 56.6% / 74.5% token; 80.9% charlocal Phase-0 measurement
Thinking share of output at max effort54.8% (n=1)local Phase-0: usage.output_tokens − count_tokens(visible blocks)
Session dollar split (reads 32/writes 29/thinking 20/visible 17/uncached 2)modeled day ~$22local Phase-0 heavy-session profile — see 01-economics-and-measurement.md
Root AGENTS.md always-on size2,738 toklocal Phase-0 measurement
Claude Code system prompt "<4 lines"/"minimize output tokens"community-documentedNOT locally verifiable (v2.1.175 compiled Bun bundle; strings not greppable); T3
Pricing (Fable 5 $10/$50; cache 0.1x/1.25x/2x; batch 50%)referencetask brief, consistent with live docs
Dollar arithmetic ($6.19/day effort ceiling; $1.43/day Edit; $0.88/day terse; $2.19/day persona ceiling)ESTIMATESarithmetic on modeled profile shown inline

On this page