12 — Context architecture — what enters the window at all

TL;DR

This is the rare layer where cutting tokens and improving quality are the same act. Anthropic's tool search cuts upfront tokens 85% (~77K → ~8.7K) while RAISING MCP-eval accuracy (Opus 4: 49% → 74%); context editing cuts consumption 84% while improving agentic performance 29% (both T1).
The defaults already moved: MCP tool schemas are deferred by default in Claude Code — only names load until a tool is used. Most 2025-era "your MCP server eats 50K of context" advice is obsolete for Claude Code users. Local ground truth: 11 MCP schemas = 1,420 tok loaded vs ~60 tok deferred (95.8% cut).
Load-less beats load-then-evict, and dumb beats clever: grep signature outlines of two real Rust files measured 85.1% and 92.5% token cuts vs full reads (local, Fable 5 count_tokens); plain observation masking halves cost while matching or beating LLM summarization on SWE-bench Verified (T2, NeurIPS '25 workshop).
The biggest unmanaged variable is the harness itself: Claude Code's fixed prefix grew ~70K tokens in 5 days in April 2026 (median first-cache 38-52K → 112-119K; issue #45188 closed "not planned"). On the modeled heavy day that swing ≈ $13/day ESTIMATE — re-audit every release. Today (v2.1.175) a deferred-MCP subagent session's entire first-prompt cache footprint is 28,669 tok (locally measured twice).
Prefix economics compound through caching: every 10K of always-on prefix ≈ $0.01/turn in cache reads ≈ $1.89/day on the modeled profile (114 reads + 6 writes/day). Cache reads are 32% of heavy-session dollars (local Phase-0; see 01-economics-and-measurement.md).

Why this layer is mostly NEGATIVE-COST: the context-rot evidence base

Tokens in the window are not free even when cached, because model accuracy itself degrades with input length. Three converging results license treating pruning as quality work:

Lost in the Middle (Liu et al., TACL 2023, arXiv 2307.03172): "Performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts." U-shaped position curve. T2. Caveat: GPT-3.5-era subjects — port the direction, never the magnitudes.
Chroma context-rot report (July 2025, trychroma.com/research/context-rot): 18 models (5 Anthropic incl. Claude Opus 4, 7 OpenAI incl. o3/GPT-4.1, 3 Google, 3 Alibaba) all degrade as input grows, "even on simple tasks." "Even a single distractor reduces performance relative to the baseline (needle only), and adding four distractors compounds this degradation further." On LongMemEval, "The Claude models exhibit the most pronounced gap between focused and full prompt performance." T2.
Anthropic's interventional confirmation: removing stale context raised agentic performance 29% (claude.com/blog/context-management). T1 (internal eval).

One inversion to respect (anthropic.com/engineering/effective-context-engineering-for-ai-agents): "minimal does not necessarily mean short; you still need to give the agent sufficient information up front." Deleting real rules to save tokens is QUALITY-TRADE, not optimization.

The three sub-layers:

Layer	Question	Techniques (this file)	Typical lever
Never-load	Does it enter the prefix at all?	MCP deferral (12.1), prompt audit (12.2), skills (12.3), path-scoped rules (12.4)	85-96% of the artifact, forever
Load-less	Can a cheaper representation do the job?	Repo maps/outlines (12.5), LSP navigation (12.6), hook preprocessing (12.7)	85-92% per file/result
Evict	When does it leave?	Observation masking (12.8), /clear-/compact discipline (12.9), subagent isolation (12.10), external memory (12.11), learned pruning (12.12)	84-88% of stale history

Local measurements (this repo, Fable 5 count_tokens via /tmp/ct.py)

Artifact	Lines	Tokens	Note
`AGENTS.md` (root, always-on)	75	2,744	under the official 200-line rule; matches Phase-0's 2,738 net of harness overhead
`.github/AGENTS.md`	317	10,694	path-scoped: loads only when working there
`docs/AGENTS.md`	160	5,982	path-scoped
`crates/AGENTS.md`	111	2,821	path-scoped
`crates/jackin-tui-lookbook/AGENTS.md`	48	1,053	path-scoped
`docker/construct/AGENTS.md`	18	381	path-scoped
Scoped total	—	20,931	88.4% of this repo's 23,675-tok instruction mass loads on demand
scroll.rs full read	455	5,542	crates/jackin-tui/src/scroll.rs
scroll.rs signature outline	37	828	−85.1%
mount_info.rs full read	289	4,410	crates/jackin-console/src/mount_info.rs
mount_info.rs signature outline	15	331	−92.5%
`git ls-files` (1,245 files)	—	29,087	the bare tree costs 14.5% of a 200K window for zero code
Full repo content	—	≈3.3M ESTIMATE	11,241,888 bytes measured ÷ ~3.4 chars/tok
This session's first cache write	—	28,669	v2.1.175 subagent profile, 37 deferred tool names, incl. ~9K task prompt; from session JSONL `cache_creation_input_tokens`; independently matches the sweep session's figure

Arithmetic constants used below (modeled heavy day from local Phase-0, see 01-economics-and-measurement.md): 6 sessions × 19 API calls = 114 prefix reads/day; prefix cache-written ≥6×/day. Fable 5: cache read $1/MTok, cache write $12.5/MTok (5-min TTL). So each 1K tok of always-on prefix ≈ $0.114 reads + $0.075 writes ≈ $0.19/day; each 10K ≈ $1.89/day ≈ 8.6% of the $22 modeled day. Note: the official costs page reports enterprise average "around $13 per developer per active day," 90% below $30/day — the modeled $22 profile is a plausible heavy-user anchor.

Techniques

12.1 MCP schema deferral / tool search

Stop paying for tool definitions you never call: load tool NAMES only; fetch full JSON schemas on demand. Now the Claude Code default — the single biggest fixed-cost eliminator for MCP-heavy setups.

Layer: input — never-load (fixed prefix)
Mechanism: tools register with defer_loading; the model sees a one-line name list plus a search tool (tool_search_tool_regex_20251119 / _bm25_); on demand 3-5 matching tool_reference blocks expand into full schemas. Claude Code: on by default ("MCP tool definitions are deferred by default, so only tool names enter context until Claude uses a specific tool" — costs doc); ENABLE_TOOL_SEARCH=auto loads schemas upfront only when they fit within 10% of the window; =false loads everything.
Expected savings: Anthropic: "an 85% reduction in token usage" — ~77K → ~8.7K upfront for a 58-tool/five-server setup (~55K of it schemas). Local: 11 schemas 1,420 → ~60 tok (95.8%; Phase-0). Modeled profile: an undeferred GitHub-MCP-like 55K of schemas would cost ≈ 55 × $0.19 ≈ $10.4/day ESTIMATE (~47% on top of the $22 day); deferral reduces that to ~$0.01/day. This session live: 37 tools deferred to a name list, total first-cache 28,669 tok.
Evidence tier: T1 — anthropic.com/engineering/advanced-tool-use; platform.claude.com tool-search docs; code.claude.com/docs/en/costs ; local Phase-0 reproduction.
Quality risk: NEGATIVE-COST — Anthropic's own evals show accuracy rising with deferral: Opus 4 49% → 74%, Opus 4.5 79.5% → 88.1% on MCP evals, because schema noise is gone. Residual: search can miss an obscure tool — keep hot tools non-deferred.
Availability: CLAUDE-CODE-TODAY (default; verified live) / SDK (beta) / GATEWAY-OR-SELF-HOST for other harnesses.
Effort to adopt: zero in Claude Code; one flag on the API. Run /mcp and /context to see per-server costs.
Composability: stacks with everything; multiplies with caching (smaller prefix = cheaper writes AND reads). Orthogonal to eviction and repo maps.
Validation protocol: 20 representative tool-using tasks, A/B ENABLE_TOOL_SEARCH=false vs default; compare first-message cache_creation_input_tokens, total session tokens, and tool-selection correctness (did it pick the right tool first try). Expect: tokens down ≥80% on the schema mass, selection accuracy equal or better (Anthropic saw +8.6 to +25 points).

12.2 System-prompt and CLAUDE.md mass audit — re-run every release

Your fixed prefix bills into every request; measure it, keep CLAUDE.md under 200 lines, and re-measure after every Claude Code update, because the harness baseline itself swings by tens of thousands of tokens.

Layer: input — never-load (fixed prefix)
Mechanism: instruments: /context, /mcp, /usage, the free count_tokens API, and the ground-truth probe — first assistant message's cache_creation_input_tokens in the session JSONL under ~/.claude/projects/. Official: "Aim to keep CLAUDE.md under 200 lines by including only essentials"; auto memory loads only the first 200 lines / 25KB of MEMORY.md.
Expected savings: bounds harness drift: the initial prompt "grew by approximately 70K tokens" in ~5 days (median first-cache ~38-52K at v2.1.89-90 → ~80-88K at v2.1.91 → ~112-119K at v2.1.96, April 2026; issue #45188, closed as not planned). That swing ≈ 70 × $0.19 ≈ $13.2/day ESTIMATE (+60% of the modeled day). Local today: v2.1.175 subagent first-cache = 28,669 tok ≈ $0.029/turn in cache reads — post-bloat baselines came back down, but only measurement tells you. Per 10K of prefix: ~$0.01/turn read, ~$1.89/day total. (Corrects an upstream sweep note that said "$0.10/turn per 10K" — off by 10x against its own examples; 10K × $1/MTok = $0.01.)
Evidence tier: T3 for the volatility history (one reporter's JSONL analysis, methodology shown); T1-local for the instrument (reproduced twice here: 28,669 both sessions).
Quality risk: NEGATIVE-COST — bloated always-on instructions are noise the model must ignore. Inversion: minimal ≠ short; cutting load-bearing rules is QUALITY-TRADE.
Availability: CLAUDE-CODE-TODAY (all instruments built-in or free).
Effort to adopt: minutes — grep -o '"cache_creation_input_tokens":[0-9]*' SESSION.jsonl | head -1; pipe CLAUDE.md through /tmp/ct.py.
Composability: the meta-technique that prioritizes all others; prefix cuts compound via the 0.1x cache-read multiplier on every turn (cache reads = 32% of heavy-session dollars; see 13-caching-exploitation.md).
Validation protocol: per release: fresh session, one trivial prompt, read first cache_creation_input_tokens; alert on ±10K vs last release. For CLAUDE.md edits: count_tokens before/after + a 10-task regression run checking rule adherence (branch policy, brand spelling, etc.) is unchanged.

12.3 Skills as lazy loading (progressive disclosure)

Package workflow/reference docs as skills: ~100 always-on tokens buy unlimited on-demand content, vs CLAUDE.md where every word bills into every request forever.

Layer: input — never-load (instructions)
Mechanism: official three-level table : Level 1 metadata always loaded ≈ "~100 tokens per Skill"; Level 2 SKILL.md body "Under 5k tokens" loads when triggered; Level 3 bundled files/scripts "Effectively unlimited" — scripts run via bash and only their OUTPUT enters context. disable-model-invocation: true drops even the description: zero cost until you type /name. Description cap 1024 chars.
Expected savings: per workflow doc moved out of CLAUDE.md: ≤5K standing → ~100 tok standing (~50x standby cut). A 5K-tok doc: $0.95/day standing → $0.02/day + body cost only on the days it triggers (ESTIMATE, modeled profile). Counterweight at scale: April 2026 report attributed "~7KB for 47 skills injected on every turn" to the skill listing.
Evidence tier: T1 (platform.claude.com agent-skills overview; code.claude.com features-overview).
Quality risk: NEGATIVE-COST with crisp descriptions; RISKY if vague — wrong-skill or no-trigger is a failure mode deferral doesn't have. Post-/compact nuance (verified in docs page source): the skill LISTING is the one startup block not re-injected; invoked skill bodies re-inject capped at 5,000 tok/skill and 25,000 total, oldest dropped, truncation keeps the top of SKILL.md — put critical instructions first.
Availability: CLAUDE-CODE-TODAY / SDK (beta headers) / claude.ai.
Effort to adopt: low-medium (hours): split CLAUDE.md into SKILL.md files; the description becomes load-bearing.
Composability: official escape valve for the 200-line rule (12.2); skills: preloads fully into subagents (12.10) instead of your window.
Validation protocol: move one section to a skill; run 10 prompts that need it and 10 that don't; require 10/10 trigger rate, 0/10 false loads, and count_tokens-verified ~100-tok standing cost. Measure your real metadata cost — "~100 tok" is a rule of thumb, actual varies with description length.

12.4 Path-scoped rules (.claude/rules paths frontmatter, per-directory agent rules)

Conditional instruction loading: directory- or language-specific rules cost zero until a matching file is touched — CLAUDE.md semantics at skill economics.

Layer: input — never-load (instructions)
Mechanism: .claude/rules/*.md with paths frontmatter "only load when Claude works with matching files" (features-overview). Same pattern as this repo's per-directory agent rule auto-loading.
Expected savings: measured here: this repo keeps 20,931 of 23,675 instruction tokens (88.4%) path-scoped (table above). If those were all always-on: ≈ 21 × $0.19 ≈ $3.97/day ESTIMATE; sessions outside those subtrees never pay it. No official published numbers — trivially measurable per repo with count_tokens.
Evidence tier: T1 for the mechanism (official docs); T1-local for the savings arithmetic (measured token masses).
Quality risk: NEGATIVE-COST — rules arrive exactly when relevant, beside the file that triggered them. Risk: a rule you assumed global now fires only on path match; keep MUST-ALWAYS rules in CLAUDE.md or hooks.
Availability: CLAUDE-CODE-TODAY.
Effort to adopt: low (hours): split CLAUDE.md sections into scoped rule files.
Composability: the middle rung of the instruction ladder — CLAUDE.md (always) → rules (on path match) → skills (on invoke) (12.2, 12.3).
Validation protocol: count_tokens each scoped file; run sessions inside and outside the scoped paths verifying (a) /context shows the rule only when expected, (b) a seeded rule-violation prompt is caught inside the path. Confirm no always-on regression by re-reading first-cache size (12.2).

12.5 Repo maps and outlines instead of file dumps

Never paste whole files for orientation: a ranked, budgeted signature map gives the repo's shape at 1/7th-1/13th the tokens per file.

Layer: input — load-less (file content)
Mechanism: aider: tree-sitter symbol graph + graph ranking selects signatures into a fixed budget — "the --map-tokens switch, which defaults to 1k tokens," expanding "especially when no files have been added to the chat" (repomap docs). Manual Claude Code equivalent: grep-first navigation, partial Reads (offset/limit), ctags-style outlines.
Expected savings: local: 85.1% and 92.5% per file (table above). Modeled profile: a 5K-tok file read at call 5 of 19 costs ~$0.0625 write + ~$0.07 in 14 re-reads ≈ $0.133/session; an 828-tok outline ≈ $0.022 — ~$0.11 saved per orientation file per session; 10 orientation files × 6 sessions ≈ $6.6/day gross, outline discipline cuts ~83-90% of it (ESTIMATE). Anti-pattern measured: the bare 1,245-file tree = 29,087 tok — ranking + budget is the technique; the tree is not the map.
Evidence tier: T1 (shipped in aider). Gap: no map-vs-no-map A/B on 2025-26 models; aider's benchmarks are 2023-stale.
Quality risk: NEGATIVE-COST as designed (built to make models better in large repos; context-rot evidence predicts focused beats full even when full fits). Risk: outline omits the body you needed — mitigated because pointers expand on demand (Anthropic just-in-time retrieval pattern: "maintain lightweight identifiers (file paths, stored queries, web links)… dynamically load data at runtime").
Availability: CLAUDE-CODE-TODAY as discipline (CLAUDE.md rule + Explore subagent); aider ships it; GATEWAY-OR-SELF-HOST to replicate the ranked map.
Effort to adopt: minutes for the rule ("grep before reading; Read with offset/limit; never cat whole files for orientation"); replicating aider's PageRank map is a project.
Composability: feeds subagent isolation (12.10: Explore ≈ self-assembling repo map) and LSP navigation (12.6).
Validation protocol: 10 orientation questions ("where is X handled?") answered outline-first vs full-read; compare correctness and tokens/task. Expect equal correctness at 5-10x fewer file-content tokens; any wrong answer triggered by a missing body falsifies the outline pattern for that file class.

Ask the language server where a symbol lives and read just that body, instead of grep-then-read-whole-candidates. Officially endorsed; hard numbers unpublished.

Layer: input — load-less (file content)
Mechanism: LSP tools (go-to-definition, find-references, find_symbol) return bodies/locations directly. Official costs doc : "A single 'go to definition' call replaces what might otherwise be a grep followed by reading multiple candidate files"; installed language servers also report type errors after edits. Serena MCP is the self-host version; Claude Code has first-party code-intelligence plugins (and an LSP tool visible in this session's deferred list).
Expected savings: no published measurements anywhere (Serena README is qualitative — "much more token-efficient"; checked). Magnitude bound from 12.5: not-reading-the-rest-of-the-file is worth 85-92% per file touched (local). Counterweight: an LSP MCP server adds its own schemas — mitigated by default deferral (12.1).
Evidence tier: T3 (official mechanism endorsement + community anecdote; zero published benchmarks).
Quality risk: NEGATIVE-COST direction (precise navigation beats text search in big/typed codebases); LSP can be stale on unsaved/generated code.
Availability: CLAUDE-CODE-TODAY (code-intelligence plugins; Serena via .mcp.json) / GATEWAY-OR-SELF-HOST.
Effort to adopt: low (minutes-hours): install the language plugin; best on typed languages with healthy servers.
Composability: pairs with grep-first (grep finds candidates, LSP picks one) and outlines (12.5); diagnostics-after-edit replaces compile-run cycles (separate channel).
Validation protocol: the unclaimed cheap experiment: same 20 navigation/edit tasks with plugin on/off; record tokens/task, file-content tokens specifically, and solve rate. This would produce the first quantitative LSP-navigation number — none exists.

12.7 Hook-based preprocessing / tool-output filtering

Filter tool output BEFORE it becomes context: a PreToolUse hook rewriting npm test to pipe through grep turns ten-thousand-line logs into hundreds of tokens, deterministically, at zero standing cost.

Layer: input — load-less (tool results, pre-ingestion)
Mechanism: hooks run outside the conversation (context cost zero unless they return output). Official example (costs doc): PreToolUse on Bash detects test runners and rewrites to … 2>&1 | grep -A 5 -E '(FAIL|ERROR|error:)' | head -100 via updatedInput. Official claim: "Instead of Claude reading a 10,000-line log file to find errors, a hook can grep for ERROR and return only matching lines, reducing context from tens of thousands of tokens to hundreds."
Expected savings: ~99% on the targeted artifact (illustrative, not benchmarked). Scale check: 10K lines × ~60 chars ≈ 600KB ≈ ~180K tok (ESTIMATE at 3.35 chars/tok) — one unfiltered log can exceed the whole 200K window, so this is sometimes possible-vs-impossible, not just cheaper. Note the harness already has a backstop: oversized tool results persist to disk with a ~2KB inline preview (observed live this session: a 55.9KB WebFetch result).
Evidence tier: T1 for mechanism (official docs with full script); the savings figure is an official illustration, not a benchmark.
Quality risk: NEGATIVE-COST when the filter preserves failure signal; RISKY if it eats the line the model needed — keep filters conservative (grep -A/-B windows, head caps), and let the model re-run unfiltered on demand.
Availability: CLAUDE-CODE-TODAY.
Effort to adopt: medium (hours): write and maintain shell hooks; failure modes are yours.
Composability: composes with eviction (12.8) — masking can only evict what entered; hooks stop it entering. Cousin of skills' script pattern (12.3: script code never enters context, only output).
Validation protocol: seed N known test failures; run suite with/without filter; require the model to identify the same root cause in both arms while tokens drop ≥90% on the filtered arm. Any missed root cause = filter too aggressive; widen -A/-B.

12.8 Observation masking / tool-result eviction (context editing, microcompact)

Old tool outputs are the bulk of a long session and mostly dead weight: surgically clear them (keeping the call record) instead of summarizing the conversation. Cheapest eviction; measurably improves performance.

Layer: evict (conversation history)
Mechanism: replace stale tool_result blocks with a placeholder. API: context-management beta, clear_tool_uses_20250919. Claude Code: automatic — "It clears older tool outputs first, then summarizes the conversation if needed" (how-claude-code-works); community source-reads say microcompact keeps the ~5 most recent tool results and uses server-side cache edits to avoid invalidating the cached prefix (T3, constants unverified — see Gaps).
Expected savings: Anthropic: 84% token-consumption reduction in a 100-turn web-search eval, with context editing alone delivering "a 29% improvement" in performance . Independent T2: JetBrains "Complexity Trap" (arXiv 2508.21433, NeurIPS '25 DL4C) — observation masking "halves cost relative to the raw agent while matching, and sometimes slightly exceeding, the solve rate of LLM summarization" on SWE-bench Verified, 5 model configs; hybrid adds 7%/11% further. Modeled profile ESTIMATE: avg context/call ≈ 61.6K (1.17M÷19); prefix ≈ 28.7K leaves ~33K history/files; halving the history component saves ~1.87M cache-read tok/day ≈ $1.87/day (~8.5%) — the 84% figure belongs to much longer-horizon sessions.
Evidence tier: T1 (shipped + vendor eval) corroborated by T2 (independent peer-reviewed workshop paper).
Quality risk: NEGATIVE-COST — +29% with 84% fewer tokens; masking matches summarization solve rates at half cost. Risk: agent needs an evicted result and re-runs the tool (usually small re-read cost).
Availability: CLAUDE-CODE-TODAY (automatic, not user-tunable) / SDK (beta; Bedrock, Vertex).
Effort to adopt: zero in Claude Code; one config block on the API.
Composability: complements deferral (12.1: prefix vs history); interacts with caching — naive client-side deletion invalidates the cached prefix; Claude Code's cache-edit approach exists precisely for this (see 13-caching-exploitation.md).
Validation protocol: API replay of a 50+-turn agentic task with context_management on/off: compare completion rate, correctness, total tokens. Anthropic's 100-turn protocol predicts the on-arm completes tasks the off-arm fails on context exhaustion.

12.9 Compaction discipline: /clear over /compact, steered compaction

Compaction is lossy and costs a model call; clearing is free and lossless-by-irrelevance. /clear at task boundaries; steer /compact when you must continue; keep survival-critical rules in CLAUDE.md because summaries drop them.

Layer: evict (conversation history)
Mechanism: /clear resets the window ("Stale context wastes tokens on every subsequent message"; /rename then /resume to return). /compact summarizes via the model — preserving "architectural decisions, unresolved bugs, and implementation details," continuing with the summary "plus the five most recently accessed files" (eng blog). Steerable: /compact Focus on code samples and API usage, or a "Compact instructions" CLAUDE.md section. Auto-compact stops with a thrashing error if a huge artifact refills context after each pass.
Expected savings: the official interactive walkthrough encodes the post-compact summary as Math.round(sumTokens * 0.12) — 12% of accumulated conversation, i.e. an ~88% history cut (page source; an illustrative constant, not a measured distribution) — bracketing the API-side 84%. /clear saves 100% of dead history plus the compaction call. Cache caveat: both rewrite the prefix — next turn pays 1.25-2x write on new content; with writes already 29% of heavy-day dollars, compacting MORE often is not automatically cheaper.
Evidence tier: T1 for mechanics (official docs); the 12% ratio is a docs-simulation constant (flagged as such).
Quality risk: /clear at true boundaries: NEGATIVE-COST (fresh window beats rotted context). /compact mid-task: QUALITY-TRADE — official: "detailed instructions from early in the conversation may be lost." Post-compact reload set (verified in page source): system prompt, CLAUDE.md, memory, MCP reload; the skill listing does not.
Availability: CLAUDE-CODE-TODAY.
Effort to adopt: zero — behavioral; one CLAUDE.md section.
Composability: last-resort layer after 12.1/12.7/12.8 (Claude Code itself orders it: clear tool outputs first, summarize second). CLAUDE.md is the durable instruction channel across compactions (12.11).
Validation protocol: measure real compact ratios from your session JSONLs (pre/post token counts) vs the 12% constant; for quality, seed an early-conversation constraint and test whether post-compact behavior still honors it — if not, that constraint belonged in CLAUDE.md.

12.10 Subagent context isolation (context firewall)

Route bulk reading through a subagent: dozens of reads happen in a disposable window; only a 1-2K summary lands in yours. Protects main-context quality; does NOT necessarily cut total dollars.

Layer: never-load (main window) / spend-elsewhere
Mechanism: subagents get their own fresh context; the main thread pays the spawn prompt + returned summary. Built-in Explore/Plan omit CLAUDE.md and git status from startup; subagents run a slimmer system prompt than the interactive host.
Expected savings: main-window: 20 files × ~2K tok lands as a summary "often 1,000-2,000 tokens" instead of ~40K (eng blog). Total-dollar: often negative — the subagent pays its own, frequently cold-cache, bill; official: "Agent teams use approximately 7x more tokens than standard sessions when teammates run in plan mode" (costs doc — plan-mode agent teams specifically; do not generalize to single subagents). Cross-ref: the caveman cavecrew plugin claims ~60% smaller re-injected subagent output (plugin description; Phase-0 measured the underlying ultra register at 58.5% on visible prose).
Evidence tier: T1 for isolation semantics (official docs); the 1-2K summary is a description, not a guarantee.
Quality risk: NEUTRAL on cost, quality-protective for the main thread (stays under the rot knee, 12.0). RISKY for tight back-and-forth tasks — the summary boundary loses detail.
Availability: CLAUDE-CODE-TODAY (Task/Explore, custom agents).
Effort to adopt: zero built-in; custom agents need a .md definition.
Composability: multiplies with repo maps (12.5) and cheap-model routing (model: haiku for simple subagents — official tip); the GitHub-MCP mitigation (tool-search subagent, main-thread 51,000 → 8,500) is this pattern.
Validation protocol: same research question inline vs via Explore: record main-window growth, TOTAL dollars across both contexts (subagent usage included), and answer quality. Claiming dollar savings without summing the subagent's own bill is the category error to avoid.

12.11 Memory outside the window (memory tool, auto memory, note files)

State that must persist lives in files, not conversation: the window carries pointers, and eviction becomes safe because evicted facts are recoverable.

Layer: evict-enabler / never-load
Mechanism: API memory tool (beta): file-based store read/written across sessions. Claude Code: auto memory (MEMORY.md, "first 200 lines or 25KB, whichever comes first" auto-loaded), CLAUDE.md as the durable rule store, agent-maintained NOTES.md/todos (Anthropic structured-note-taking pattern).
Expected savings: indirect but measured: memory tool + context editing = +39% over baseline vs +29% editing alone — external memory adds 10 points on top of pure eviction. Auto memory's 25KB cap ≈ 7.6K tok ESTIMATE bounds standing cost by construction.
Evidence tier: T1 (shipped; vendor internal eval, no public methodology).
Quality risk: NEGATIVE-COST per the eval; risk is staleness/pollution — bad remembered "facts" persist across sessions. Audit MEMORY.md periodically.
Availability: SDK (memory tool beta) / CLAUDE-CODE-TODAY (auto memory, CLAUDE.md, note files).
Effort to adopt: low in Claude Code (already on); medium on the API (file backend).
Composability: the enabling pair of 12.8 (Anthropic ships and evaluates them together); what makes /clear safe mid-project (12.9).
Validation protocol: kill-and-resume test: end a session mid-task, resume fresh; measure tokens-to-reorientation and task continuity with notes/memory vs without. Plus a staleness audit: grep MEMORY.md for facts contradicted by the current codebase.

12.12 Learned context pruning (SWE-Pruner) — research frontier

A 0.6B-parameter "skimmer" prunes irrelevant code lines from agent context on the fly, goal-conditioned — the next step beyond static masking.

Layer: load-less (tool results, learned)
Mechanism: the agent states a retrieval goal; the skimmer scores and keeps only relevant lines from long observations before they enter the main model's context.
Expected savings: "23-54% token reduction on agent tasks like SWE-Bench Verified while even improving success rates; up to 14.84x compression on single-turn tasks like LongCodeQA with minimal performance impact" (arXiv 2601.16746 abstract, v4).
Evidence tier: T2 (peer-track preprint, single group, awaiting replication).
Quality risk: NEGATIVE-COST per the paper's own evals; unreplicated. Skimmer inference adds latency/cost the paper does not fully price at small scales.
Availability: GATEWAY-OR-SELF-HOST (research code; in no shipped harness found).
Effort to adopt: high (project): sidecar model via hook or gateway middleware.
Composability: generalizes hook-greps (12.7: learned filter vs regex); could stack under masking (prune on entry, mask on age).
Validation protocol: replicate on a SWE-bench Verified subset: solve rate and tokens with/without skimmer; require the paper's improved-success-rate direction to hold before trusting the cut.

Claims killed

"The GitHub MCP server eats ~50K of your Claude Code context" — stale for Claude Code: schemas are deferred by default (only names load; costs doc). Verified live: this v2.1.175 session defers 37 tools to a name list; entire first-cache = 28,669 tok. The ~42-55K/93-tool figure (getunblocked.com) survives only for raw API/gateway integrations without defer_loading.
unblocked.com's "46.9% cut" — internally inconsistent: the article labels "main-thread token usage dropping from 51,000 to 8,500" as "a 46.9% cut"; 51,000→8,500 is 83.3%. One of the figures measures something else or is wrong; re-measure before propagating either.
"Just give the model the repo tree" — measured dead: the bare tree of this 1,245-file repo is 29,087 tok (14.5% of a 200K window) for zero code content. Ranked + budgeted maps (aider: 1K default) are the technique; the tree is not the map.
"LLM summarization is the gold standard for agent context management" — killed by arXiv 2508.21433: dumb observation masking halves cost while matching-or-exceeding summarization's solve rate. Anthropic's shipped ordering agrees: clear tool outputs first, summarize last.
"Claude Code's system prompt is ~4K tokens" — the 4,200-tok "System prompt" in the official context-window walkthrough is a hardcoded JS simulation constant (tokens: 4200, page source), not a measurement. Real first-cache baselines: 28,669 (v2.1.175 subagent, local ×2), ~38-52K interactive (Apr 2026), ~112-119K during the bloat. Citing 4.2K understates the prefix by ~an order of magnitude.
"Lost in the Middle showed ~20-point drops, expect that on Fable 5" — 2023 GPT-3.5-era magnitudes don't transfer. Direction replicates (Chroma, 18 models incl. Claude Opus 4); magnitudes are model-specific. Quote the direction; re-measure the magnitude.
"Caveman-style compression cuts ~75% of tokens" — Phase-0 killed it: ultra register = 58.5% measured TOKEN cut; ~75% is character-level folklore. Applies to any "compress your CLAUDE.md by X%" claim measured in characters (wenyan-full: 80.9% chars, only 56.6% tokens).
"Skills metadata is free" — it's ~100 tok per skill, every request: tens of skills ≈ a few K of permanent prefix (the April report measured ~7KB/turn for 47 skills). Cheap and capped (disable-model-invocation: true → zero), not free.

Gaps and open experiments

No public aggregate token count for the CURRENT interactive (non-subagent) Claude Code system prompt at v2.1.17x — a 10-minute fresh-session JSONL probe closes it; repeat per release (demonstrated ±70K/5-day volatility).
LSP/Serena navigation has zero published quantitative benchmarks (12.6's protocol is cheap, unclaimed territory).
Anthropic's 85%/84%/29%/39% are internal evals without public harnesses — directionally corroborated by the independent T2 JetBrains result, unreplicable as stated.
The cache-economics interaction is unpublished: eviction rewrites cached prefix (1.25-2x write) while prefix-shrinking saves 0.1x reads forever; with reads at 32% and writes at 29% of heavy-day dollars, optimal eviction FREQUENCY is an open optimization problem (see 13-caching-exploitation.md).
Microcompact's exact constants ("keeps ~5 most recent tool results") rest on secondary source-reads; not officially documented (one community source returned HTTP 403).
Repo-map quality evidence is stale (aider design 2023, no modern A/B); SWE-Pruner is the nearest modern datapoint but tests learned pruning, not static maps.

Verification ledger

Number	Source / method (all accessed or)
85% token cut; ~77K→~8.7K; 58 tools ≈ 55K; Opus 4 49%→74%; Opus 4.5 79.5%→88.1%; PTC 43,588→27,297 (37%)	anthropic.com/engineering/advanced-tool-use (fetched; quotes verbatim)
84% token cut; +29% editing alone; +39% with memory tool; 100-turn eval	claude.com/blog/context-management (fetched; quotes verbatim)
Masking halves cost, matches/exceeds summarization solve rate; hybrid −7%/−11%; SWE-bench Verified, 5 configs; NeurIPS '25 DL4C	arxiv.org/abs/2508.21433 (fetched; abstract verbatim)
SWE-Pruner 23-54% cut, improved success; 14.84x LongCodeQA; 0.6B skimmer; v1, v4	arxiv.org/abs/2601.16746 (fetched; abstract verbatim)
aider --map-tokens default 1k; expansion when no files in chat	aider.chat/docs/repomap.html (fetched)
MCP deferral on by default; CLI (gh/aws) cheaper than MCP; 10,000-line-log hook claim + script; agent teams ≈7x in plan mode; model: haiku tip; /clear guidance; CLAUDE.md <200 lines; enterprise avg ~$13/dev/active-day, 90% <$30	code.claude.com/docs/en/costs (fetched; quotes verbatim)
System prompt 4,200 = simulation constant; MCP deferred ≈120 tok in walkthrough; ENABLE_TOOL_SEARCH=auto 10% threshold; compact summary = Math.round(sumTokens*0.12); skill listing not re-injected post-compact; skill bodies re-inject capped 5,000/skill, 25,000 total	code.claude.com/docs/en/context-window (fetched; page JS source grepped from persisted output)
Sub-agent summaries "often 1,000-2,000 tokens"; compaction preserves decisions + "five most recently accessed files"; just-in-time identifiers; minimal ≠ short	anthropic.com/engineering/effective-context-engineering-for-ai-agents (fetched; verbatim)
18 models (5 Anthropic/7 OpenAI/3 Google/3 Alibaba); single-distractor degradation, four compound; Claude largest focused-vs-full gap	trychroma.com/research/context-rot (fetched; verbatim)
GitHub MCP ~42K and 55K/93 tools; 51,000→8,500 labeled "46.9%"; CLI 4-32x cheaper/op; $3.20 vs $55.20 at 10K ops (Sonnet 4)	getunblocked.com/blog/github-mcp-token-cost (fetched; verbatim, inconsistency flagged)
~100 tok/skill; SKILL.md <5k; Level 3 unlimited; description ≤1024 chars	platform.claude.com/docs/en/agents-and-tools/agent-skills/overview (fetched; table verbatim)
First-cache 38-52K → 80-88K → 112-119K; "~70K in 5 days"; ~7KB/47 skills; closed not planned	github.com/anthropics/claude-code/issues/45188 (fetched)
`AGENTS.md` 2,744 tok/75 lines; scoped agent rule files 10,694 + 5,982 + 2,821 + 1,053 + 381 = 20,931 (88.4% scoped)	local: cat FILE \| python3 /tmp/ct.py claude-fable-5
scroll.rs 5,542 full vs 828 outline (−85.1%); mount_info.rs 4,410 vs 331 (−92.5%)	local: ct.py full file vs grep signature pattern (shown above)
git ls-files = 1,245 files = 29,087 tok; repo = 11,241,888 bytes ≈ 3.3M tok ESTIMATE	local: git ls-files \| ct.py; xargs wc -c; ÷3.4 chars/tok
First cache_creation_input_tokens = 28,669 (×2 sessions); harness v2.1.175; 37 deferred tool names	local: session JSONL grep; claude --version; system-reminder count
11 MCP schemas 1,420 vs ~60 deferred; dollar split 32/29/20/17/2; session profile (19 calls, 1.17M cache-read, 27K out); caveman 58.5%/56.6%/80.9%	local Phase-0 measurements (ground truth per dossier brief)
Fable 5 $10/$50/MTok; cache read 0.1x, write 1.25x (5min); all $/day arithmetic	dossier reference pricing (see 01-economics-and-measurement.md) + arithmetic shown inline, labeled ESTIMATE

12 — Context architecture — what enters the window at all

12 — Context architecture — what enters the window at all

Why this layer is mostly NEGATIVE-COST: the context-rot evidence base

Local measurements (this repo, Fable 5 count_tokens via /tmp/ct.py)

Techniques

12.1 MCP schema deferral / tool search

12.2 System-prompt and CLAUDE.md mass audit — re-run every release

12.3 Skills as lazy loading (progressive disclosure)

12.4 Path-scoped rules (.claude/rules paths frontmatter, per-directory agent rules)

12.5 Repo maps and outlines instead of file dumps

12.6 Symbol-graph / LSP navigation instead of file reads

12.7 Hook-based preprocessing / tool-output filtering

12.8 Observation masking / tool-result eviction (context editing, microcompact)

12.9 Compaction discipline: /clear over /compact, steered compaction

12.10 Subagent context isolation (context firewall)

12.11 Memory outside the window (memory tool, auto memory, note files)

12.12 Learned context pruning (SWE-Pruner) — research frontier

Claims killed

Gaps and open experiments

Verification ledger

On this page