44 — Fleet, team & multi-tenant cache economics (hosted)
44 — Fleet, team & multi-tenant cache economics (hosted)
Volume II area file for blind spot 4. Volume I covered the
self-hosted fleet tier fully (19:116-187, HiCache/LMCache across replicas) and the
excludeDynamicSections mechanism as one technique (13 tech 7), but left the hosted (Claude
API / subscription) cross-container story thin and explicitly flagged the dynamic-section size as
unmeasured (13 Gaps #6). This file builds the hosted-fleet model: when N containers share one
cached system prefix versus each paying its own, what a launcher (jackin') centralizes, and the
fleet×quota interaction. This is the jackin'-relevant gap — jackin' launches container fleets.
TL;DR
- The server prompt cache is workspace-scoped , not machine-scoped. N
containers share one cached system prefix if and only if they run under the same workspace +
same org + same model and send a byte-identical prefix. There is no machine, directory,
container, or worktree key on the server cache — the "machine+directory / git-snapshot / worktree"
rules Volume I cited (
13tech 7) describe Claude Code's local file cache (GitHub #17531), a different layer. - A ~111-token dynamic block silently un-shares a ~28–34k-token prefix. The Agent SDK preset puts
six per-container fields (working dir, git-repo flag, platform, shell, OS version, auto-memory
paths) ahead of everything else; measured at ~111 tokens (≈201 with a short git status, local
count_tokens). Because cache is an exact-prefix match, those ~111 varying tokens invalidate the entire downstream system prefix, forcing each container to write ~28–34k tokens at 1.25–2× instead of reading at 0.1×.excludeDynamicSectionsmoves them into the first user message so the fleet shares one entry — the primary hosted-fleet lever, partially closing Volume I's unmeasured-size gap. - Fleet cold-start: N simultaneous launches on a cold prefix pay N writes, not 1 write + (N−1)
reads. A cache entry is readable only after the first response begins streaming; pre-warm once
(
max_tokens:0) or stagger the first request, then fan out — converting (N−1) × 1.25–2× writes into (N−1) × 0.1× reads. For a 30k prefix over a 25-container wave on Opus 4.8: ~$0.94 (staggered) vs ~$9.4 (naive) on the write line (ESTIMATE). - Two fleet caching traps, both T3-fresh: (1) Agent-tool subagents ship with
enablePromptCachinghardcoded false (GitHub #29966; ~378k wasted uncached tokens in one measured session) — fan-out fleets may miss caching entirely, a no-caching problem distinct from a sharing problem; (2) a mixed-version fleet (some SDKs < v0.2.98 TS / v0.1.58 Py) silently splits into cached and uncached cohorts because older clients ignore the flag. - Multi-tenant boundary: org isolation is absolute; workspace is the sharing unit. Tenant-per-org → zero cross-tenant prefix sharing (each writes the prefix once per TTL). Tenant-as-workspace within one org → the shared system prefix pools. And on a subscription, the whole fleet shares one pooled cap (file 41); a headless/SDK fleet can draw the separate API-rate credit, off the interactive cap — the cleanest way to keep a fleet from starving an operator's interactive seat.
Cache multipliers and the $22/day profile are inherited from Volume I; fleet dollar figures are ESTIMATE with arithmetic.
The cache layering Volume I conflated (clarification + applied correction)
There are three distinct caches, and a fleet's economics depend on the server one:
| Layer | Scope key | Where described | Fleet relevance |
|---|---|---|---|
| Server prompt cache (the 0.1× reads) | workspace + org + model + exact prefix | platform.claude.com prompt-caching | This is what N containers can share |
| Claude Code local file cache | machine + directory; git-snapshot; worktrees never share | code.claude.com prompt-caching / GitHub #17531 | Local read reuse, not the API cache |
| Subagent caching | per-spawn; enablePromptCaching default false for Agent-tool (#29966) | GitHub #29966 | May be off entirely in fan-out |
The original Volume I text (13 tech 7 + surprising-findings: "your git state is in the cache key… worktrees never
share") attributed git-snapshot/worktree/machine+directory keys to the cache scope citing the
prompt-caching docs. The hosted server cache has no such keys — it is workspace-scoped; the
git/worktree rules belong to Claude Code's local file cache. The two are easy to merge because
Claude Code's observed reuse blends both layers. This correction has now been applied to 13;
the practical upshot is favorable — hosted
fleets can share across machines/dirs, which Volume I's framing implied they could not.
When N containers share one prefix (the rule, sourced)
Server cache hit requires (platform.claude.com/docs/en/build-with-claude/prompt-caching):
- Same workspace (Claude API / Claude Platform on AWS / Microsoft Foundry isolate per workspace ; Bedrock/Vertex isolate per org only — a wider sharing boundary there).
- Same org (absolute wall between orgs — byte-identical prompts never share across orgs).
- Same model (and same effort/fast-mode — Volume I's cache-key facts hold).
- Byte-identical prefix up to the
cache_controlblock — one differing byte downstream of a match still costs from the divergence point; one differing byte upstream (the dynamic block) costs everything. - Prefix ≥ the model minimum (Opus 4.8 = 1,024; Fable 5 = 512; Haiku 4.5 = 4,096 — live values match Volume I 13 tech 11; sub-minimum prefixes silently cache nothing).
Given those, the fleet economics are: first container writes the prefix (1.25–2×); every other container that meets the rule reads it (0.1×). The entire game is making the prefix byte-identical across the fleet — which is exactly what the six dynamic fields prevent by default.
Techniques
F1. Workspace-pinned fleet cache sharing — auth the whole fleet to one workspace
The precondition for any cross-container reuse: same workspace + org + model, or the shared prefix never pools.
- Coverage-delta: New. Volume I 13 tech 7 covers the SDK flag; the workspace-scope rule (the isolation change) as the gating condition for hosted fleet sharing is not in Volume I.
- Layer: infra / fleet architecture.
- Mechanism: server caches are isolated per workspace (API/AWS-Platform/Foundry) or per org (Bedrock/Vertex). A fleet whose containers authenticate to different workspaces gets zero cross-container prefix reuse; pinning all production containers to one workspace (and one model) lets the shared system prefix cache once and be read N−1 times.
- Expected savings: turns each extra container's ~28–34k system prefix from a 1.25–2× write into a 0.1× read. For a 25-container fleet sharing a 30k prefix on Opus 4.8: write line ~$9.4 (all cold) → ~$0.94 (1 write + 24 reads) per TTL window (ESTIMATE on Volume I multipliers).
- Evidence tier: T1 (workspace-isolation docs); ESTIMATE for fleet dollars (no published fleet figures).
- Quality risk: NEUTRAL (pure cache routing). Multi-tenant caveat: do not co-workspace tenants that must not share a cache (org wall is the only hard isolation).
- Availability: SDK / GATEWAY (workspace = an API-key/credential property).
- Effort to adopt: hours (fleet auth config).
- Composability: precondition for F2/F3; pairs with the jackin' launcher (F6).
- Validation protocol: launch two containers in different dirs under one workspace with identical
prefixes; assert container 2's first call shows
cache_read>0on the system segment.
F2. excludeDynamicSections — make the prefix byte-identical across containers
Six per-container fields sit ahead of the prefix and bust it; move them into the first user message.
- Coverage-delta: Volume I 13 tech 7 names the flag; the enumerated six fields, the measured ~111-token size, the version gating, and the mixed-cohort trap are new (and partially close Volume I Gaps #6).
- Layer: input / system-prompt structure.
- Mechanism: the
claude_codepreset embeds working directory, git-repo flag, platform, shell, OS version, and auto-memory paths (~111 tokens measured; ~201 with a short git status) ahead of any append text.excludeDynamicSections:true(TS SDK ≥ v0.2.98) /exclude_dynamic_sections:True(Python ≥ v0.1.58) / CLI--exclude-dynamic-system-prompt-sectionsrelocates them into the first user message so the system prefix is byte-identical fleet-wide. Older clients silently ignore it. - Expected savings: the blast radius, not the 111 tokens: each differing dynamic block forces a full ~28–34k-token prefix rewrite; sharing it saves (N−1) × prefix × (1.25–2.0 − 0.1) cap-weighted or dollar-weighted tokens per TTL.
- Evidence tier: T1 (SDK docs + enumerated fields) + local measurement of a representative block (ESTIMATE for the exact SDK-emitted size, which needs the SDK to measure).
- Quality risk: QUALITY-TRADE (mild, documented): the six fields "carry marginally less weight" in a user message; no quantified accuracy delta published. Enable when fleet reuse beats maximally-authoritative env context.
- Availability: SDK / CLI flag.
- Effort to adopt: hours (flag + byte-aligning model/effort/tool-set/append text across the fleet).
- Composability: requires F1 (same workspace); pin SDK version ≥ floor to avoid F2's cohort split.
- Validation protocol:
count_tokensthe preset system prompt with the flag on vs off to measure your dynamic-section size; then launch two relocated-context containers and confirm sharedcache_read; run an env-sensitive task ("which directory are you in?") to confirm correctness.
F3. Fleet cold-start — pre-warm once, then staggered fan-out
A simultaneous N-container launch on a cold prefix pays N writes; one warm-up converts that to 1 write + (N−1) reads.
- Coverage-delta: Volume I 13 tech 8 covers the concurrency rule for subagent waves; applying it to a container fleet cold-start with the pre-warm-then-fan-out pattern is new framing.
- Layer: cache / launch orchestration.
- Mechanism: a cache entry is readable only after the first response begins streaming; N parallel
cold requests all pay the write. Fire one warm-up (
max_tokens:0, which writes the cache and bills zero output) or one real request, await its first token, then launch the remaining N−1 — they read at 0.1×. - Expected savings: for N containers on prefix P: naive N × (1.25–2.0)P vs 1 × (1.25–2.0)P + (N−1) × 0.1P. At N=25, P=30k, Opus 4.8 1h-TTL: ~$15 → ~$1.5 on the write line (ESTIMATE) — a ~10× cut on fleet cold-start input.
- Evidence tier: T1 (concurrency +
max_tokens:0docs); ESTIMATE for fleet dollars. - Quality risk: NEUTRAL (adds one TTFT of latency to the wave start — file 43).
- Availability: SDK (orchestrator awaits first token before fan-out).
- Effort to adopt: hours (launch sequencing in the orchestrator).
- Composability: essential with F1/F2; pairs with 1h TTL for bursty fleets (writes survive the wave); the jackin' launcher is the natural home (F6).
- Validation protocol: launch a wave simultaneously vs warm-then-fan-out; diff first-call
cache_read(≈0 vs ≈prefix) and totalcache_creation.
F4. Audit subagent caching in fan-out fleets — it may be off by default
A fan-out fleet's biggest cache loss may not be sharing but that subagents cache nothing at all.
- Coverage-delta: New (GitHub #29966, post-Volume-I). Volume I measured cavecrew/Claude Code subagents writing 5m cache (13 tech 2); this reports Agent-tool subagents with caching off — a possible version/path-dependent conflict, flagged in 49.
- Layer: cache / multi-agent.
- Mechanism: Agent SDK / Claude Code subagents spawned via the Agent tool reportedly hardcode
enablePromptCaching=false(vs the main REPL defaulting true), so each subagent call pays full uncached input — one measured session wasted ~378k tokens across 54 subagent calls (~7,013 uncached each). Issue open, unfixed at access date (Claude Code 2.1.63 / SDK 0.2.63). - Expected savings: if confirmed in your version, enabling subagent caching (or routing fan-out through cached paths) recovers the full uncached input of every subagent call — potentially the largest single fleet leak.
- Evidence tier: T3 (one community-measured session; issue open, not Anthropic-confirmed). Verify in your own version before acting — Volume I's own measurement showed subagents writing cache, so this is version/path-specific.
- Quality risk: NEUTRAL (caching is transparent).
- Availability: depends on SDK/CLI version; audit via JSONL (
cache_read/cache_creationon subagent calls). - Effort to adopt: minutes to audit; the fix is upstream.
- Composability: interacts with Volume I 13 tech 4 (subagent fan-out economics assume caching); if caching is off, fan-out is far more expensive than Volume I modeled.
- Validation protocol: from subagent JSONL, check whether
cache_read_input_tokensis ever >0; if always 0 with a >1,024-token repeated prefix, caching is disabled.
F5. Multi-tenant cache architecture — org is the wall, workspace is the sharing unit
Where you draw the org/workspace boundary decides whether tenants share or duplicate the system prefix.
- Coverage-delta: New. Volume I is single-operator; org/workspace multi-tenant cache boundaries are not analyzed.
- Layer: infra / tenancy.
- Mechanism: caches never cross orgs (hard isolation, a security/privacy guarantee). Within an org, caches isolate per workspace on API/AWS-Platform/Foundry. So: tenant-per-org = maximal isolation, zero shared-prefix benefit (each tenant re-writes the system prefix every TTL); tenant-as-workspace in one shared org = the common system prefix pools across tenants while tenant data stays in the message body (uncached or separately keyed).
- Expected savings: for M tenants sharing a 30k system prefix in one org: M writes → 1 write + (M−1) reads per TTL — but only if tenancy is modeled as workspaces, not orgs. Weigh against the isolation requirement.
- Evidence tier: T1 (org/workspace isolation docs).
- Quality risk: NEUTRAL for cost; the security tradeoff (shared-org cache vs per-org isolation) is the real decision — keep tenants that must not share a cache in separate orgs.
- Availability: SDK / GATEWAY (tenancy = credential/workspace design).
- Effort to adopt: project (tenancy architecture).
- Composability: sets the boundary within which F1–F3 operate.
- Validation protocol: confirm cross-tenant cache behavior matches intent (a second tenant in the same workspace reads the shared prefix; a tenant in a separate org does not).
F6. jackin'-baked fleet cache policy — centralize what every container must inherit
The launcher is the one place to enforce workspace pinning, the dynamic-section flag, pre-warm, version floor, and SDK-credit placement so the whole fleet inherits cache sharing without per-user discipline.
- Coverage-delta: Volume I K16 (
20) proposes baking the optimization pack into jackin'; the specific hosted-fleet items (workspace auth, excludeDynamicSections, cold-start pre-warm, SDK-version floor, off-cap credit placement) are new and concrete. - Layer: infrastructure (all fleet layers).
- Mechanism: jackin's launch env assembly is the chokepoint: pin every container to one workspace
(F1), set
--exclude-dynamic-system-prompt-sections/ the SDK flag (F2), warm the shared prefix once before fanning out the wave (F3), enforce the SDK version floor (avoid F2's cohort split / F4), and place non-interactive fleet work on the separate API-rate SDK credit so it does not starve the operator's interactive cap (41). - Expected savings: the de-duplicated fleet sum: one cached prefix instead of N; F3's ~10× cold-start cut; cap-protection from credit placement — delivered as infrastructure, not discipline.
- Evidence tier: T1 components (each lever above) + T4 for the integrated fleet total (no published integrated fleet number).
- Quality risk: NEUTRAL-to-NEGATIVE-COST (defaults encode the safe variants; F2's mild authority tradeoff is the only quality note).
- Availability: BUILDABLE (jackin' roadmap; insertion points mapped in Volume I 20 K16 / 32).
- Effort to adopt: high once, amortized across every launch.
- Composability: the composition layer for F1–F5; must respect tool-list stability (any tool-def change busts the whole fleet's prefix).
- Validation protocol: per-fleet before/after: cache-read ratio on container 2..N's first call, cold-start write total, and (subscription) interactive cap-% with fleet work on the SDK credit vs on the seat.
Surprising findings
- The hosted server cache is more shareable than Volume I implied: no machine/worktree key means a fleet across many containers and directories can read one cached prefix — the only real barriers are workspace/org and a single varying dynamic field.
- The thing that breaks fleet sharing is tiny (~111 tokens) but its blast radius is the whole ~28–34k prefix — a textbook case of cache economics being about position, not size.
- A fan-out fleet's worst cache outcome may be the opposite of a sharing failure: subagents that cache nothing (#29966). The fix is not better sharing but turning caching on.
- On a subscription, the fleet's binding constraint is the single pooled cap (41), not dollars — so the most important fleet lever may be placing non-interactive containers on the off-cap SDK credit rather than any cache trick.
Verification ledger
| # | Number / claim | Source (access) |
|---|---|---|
| 1 | Server cache workspace-isolated (API/AWS-Platform/Foundry); org-level on Bedrock/Vertex; org wall absolute | platform.claude.com/docs/en/build-with-claude/prompt-caching |
| 2 | No machine/dir/worktree key on server cache (that is Claude Code's local file cache / GitHub #17531) | platform.claude.com prompt-caching; github.com/anthropics/claude-code/issues/17531 |
| 3 | excludeDynamicSections moves 6 fields (cwd, git flag, platform, shell, OS version, auto-memory); TS ≥0.2.98 / Py ≥0.1.58; CLI flag; older clients ignore | code.claude.com/docs/en/agent-sdk/modifying-system-prompts |
| 4 | Dynamic block ≈111 tok (6 fields) / ≈201 with short git status; blast radius = full ~28–34k prefix | local count_tokens on a representative block + Volume I prefix size (02/13) |
| 5 | Concurrency rule (entry readable after first stream begins); N cold = N writes; pre-warm/stagger → 1 write + (N−1) reads | platform.claude.com prompt-caching |
| 6 | Subagent enablePromptCaching=false default; ~378k wasted tok/54 calls/session (open, unfixed) | github.com/anthropics/claude-code/issues/29966 (T3, one session) |
| 7 | Min cacheable prefix Opus 4.8 = 1,024; Fable 5 = 512; Haiku 4.5 = 4,096 (matches Volume I 13) | platform.claude.com prompt-caching |
| 8 | cache_hint/context_hint = sync-only routing hints (named in batch exclusions; no parameter reference) | platform.claude.com/docs/en/build-with-claude/batch-processing |
| 9 | Cache multipliers 0.1× read / 1.25× 5m write / 2× 1h write; Opus 4.8 base $5 | platform.claude.com prompt-caching |
| 10 | Subscription fleet shares one pooled cap; headless/SDK draws separate API-rate credit | support.claude.com (file 41 ledger) |