# 44 — Fleet, team & multi-tenant cache economics (hosted) (https://jackin.tailrocks.com/research/token-optimization/44-fleet-and-multitenant-cache/)



# 44 — Fleet, team & multi-tenant cache economics (hosted) [#44--fleet-team--multi-tenant-cache-economics-hosted]

Volume II area file for blind spot 4. Volume I covered the
**self-hosted** fleet tier fully (`19:116-187`, HiCache/LMCache across replicas) and the
excludeDynamicSections *mechanism* as one technique (`13` tech 7), but left the **hosted** (Claude
API / subscription) cross-container story thin and explicitly flagged the dynamic-section size as
unmeasured (`13` Gaps #6). This file builds the hosted-fleet model: when N containers share one
cached system prefix versus each paying its own, what a launcher (jackin') centralizes, and the
fleet×quota interaction. This is the jackin'-relevant gap — jackin' launches container fleets.

**TL;DR**

* **The server prompt cache is workspace-scoped , not machine-scoped.** N
  containers share **one** cached system prefix if and only if they run under the **same workspace +
  same org + same model** and send a **byte-identical prefix**. There is no machine, directory,
  container, or worktree key on the *server* cache — the "machine+directory / git-snapshot / worktree"
  rules Volume I cited (`13` tech 7) describe Claude Code's **local file cache** (GitHub #17531), a
  different layer.
* **A \~111-token dynamic block silently un-shares a \~28–34k-token prefix.** The Agent SDK preset puts
  six per-container fields (working dir, git-repo flag, platform, shell, OS version, auto-memory
  paths) **ahead** of everything else; measured at \~111 tokens (≈201 with a short git status, local
  `count_tokens`). Because cache is an exact-prefix match, those \~111 varying tokens
  invalidate the entire downstream system prefix, forcing each container to write \~28–34k tokens at
  1.25–2× instead of reading at 0.1×. `excludeDynamicSections` moves them into the first user message
  so the fleet shares one entry — the **primary hosted-fleet lever**, partially closing Volume I's
  unmeasured-size gap.
* **Fleet cold-start: N simultaneous launches on a cold prefix pay N writes, not 1 write + (N−1)
  reads.** A cache entry is readable only after the first response *begins streaming*; pre-warm once
  (`max_tokens:0`) or stagger the first request, then fan out — converting (N−1) × 1.25–2× writes into
  (N−1) × 0.1× reads. For a 30k prefix over a 25-container wave on Opus 4.8: \~$0.94 (staggered) vs
  \~$9.4 (naive) on the write line (ESTIMATE).
* **Two fleet caching traps, both T3-fresh:** (1) Agent-tool subagents ship with
  `enablePromptCaching` hardcoded **false** (GitHub #29966; \~378k wasted uncached tokens in one
  measured session) — fan-out fleets may miss caching entirely, a *no-caching* problem distinct from a
  *sharing* problem; (2) a **mixed-version fleet** (some SDKs \< v0.2.98 TS / v0.1.58 Py) silently
  splits into cached and uncached cohorts because older clients ignore the flag.
* **Multi-tenant boundary: org isolation is absolute; workspace is the sharing unit.** Tenant-per-org
  → zero cross-tenant prefix sharing (each writes the prefix once per TTL). Tenant-as-workspace within
  one org → the shared system prefix pools. And on a **subscription**, the whole fleet shares **one
  pooled cap** (file 41); a headless/SDK fleet can draw the separate API-rate credit, off
  the interactive cap — the cleanest way to keep a fleet from starving an operator's interactive seat.

Cache multipliers and the $22/day profile are inherited from Volume I; fleet dollar figures are
ESTIMATE with arithmetic.

***

## The cache layering Volume I conflated (clarification + applied correction) [#the-cache-layering-volume-i-conflated-clarification--applied-correction]

There are three distinct caches, and a fleet's economics depend on the *server* one:

| Layer                                    | Scope key                                                              | Where described                                | Fleet relevance                     |
| ---------------------------------------- | ---------------------------------------------------------------------- | ---------------------------------------------- | ----------------------------------- |
| **Server prompt cache** (the 0.1× reads) | **workspace + org + model + exact prefix**                             | platform.claude.com prompt-caching             | This is what N containers can share |
| Claude Code **local file cache**         | machine + directory; git-snapshot; worktrees never share               | code.claude.com prompt-caching / GitHub #17531 | Local read reuse, not the API cache |
| **Subagent** caching                     | per-spawn; `enablePromptCaching` default false for Agent-tool (#29966) | GitHub #29966                                  | May be off entirely in fan-out      |

The original Volume I text (`13` tech 7 + surprising-findings: "your git state is in the cache key… worktrees never
share") attributed git-snapshot/worktree/machine+directory keys to the cache scope citing the
prompt-caching docs. The hosted **server** cache has no such keys — it is workspace-scoped; the
git/worktree rules belong to Claude Code's **local file cache**. The two are easy to merge because
Claude Code's *observed* reuse blends both layers. This correction has now been applied to `13`;
the practical upshot is favorable — hosted
fleets *can* share across machines/dirs, which Volume I's framing implied they could not.

## When N containers share one prefix (the rule, sourced) [#when-n-containers-share-one-prefix-the-rule-sourced]

Server cache hit requires (platform.claude.com/docs/en/build-with-claude/prompt-caching):

1. **Same workspace** (Claude API / Claude Platform on AWS / Microsoft Foundry isolate per workspace
   ; Bedrock/Vertex isolate per org only — a *wider* sharing boundary there).
2. **Same org** (absolute wall between orgs — byte-identical prompts never share across orgs).
3. **Same model** (and same effort/fast-mode — Volume I's cache-key facts hold).
4. **Byte-identical prefix up to the `cache_control` block** — one differing byte downstream of a
   match still costs from the divergence point; one differing byte *upstream* (the dynamic block)
   costs everything.
5. **Prefix ≥ the model minimum** (Opus 4.8 = 1,024; Fable 5 = 512; Haiku 4.5 = 4,096 — live values
   match Volume I 13 tech 11; sub-minimum prefixes silently cache nothing).

Given those, the fleet economics are: &#x2A;*first container writes the prefix (1.25–2×); every other
container that meets the rule reads it (0.1×).** The entire game is making the prefix byte-identical
across the fleet — which is exactly what the six dynamic fields prevent by default.

***

## Techniques [#techniques]

### F1. Workspace-pinned fleet cache sharing — auth the whole fleet to one workspace [#f1-workspace-pinned-fleet-cache-sharing--auth-the-whole-fleet-to-one-workspace]

The precondition for any cross-container reuse: same workspace + org + model, or the shared prefix
never pools.

* **Coverage-delta:** New. Volume I 13 tech 7 covers the SDK flag; the **workspace-scope rule** (the
  isolation change) as the gating condition for hosted fleet sharing is not in Volume I.
* **Layer:** infra / fleet architecture.
* **Mechanism:** server caches are isolated per workspace (API/AWS-Platform/Foundry) or per org
  (Bedrock/Vertex). A fleet whose containers authenticate to different workspaces gets zero
  cross-container prefix reuse; pinning all production containers to one workspace (and one model)
  lets the shared system prefix cache once and be read N−1 times.
* **Expected savings:** turns each extra container's \~28–34k system prefix from a 1.25–2× write into a
  0.1× read. For a 25-container fleet sharing a 30k prefix on Opus 4.8: write line \~$9.4 (all cold) →
  \~$0.94 (1 write + 24 reads) per TTL window (ESTIMATE on Volume I multipliers).
* **Evidence tier:** T1 (workspace-isolation docs); ESTIMATE for fleet dollars (no
  published fleet figures).
* **Quality risk:** **NEUTRAL** (pure cache routing). Multi-tenant caveat: do not co-workspace tenants
  that must not share a cache (org wall is the only hard isolation).
* **Availability:** SDK / GATEWAY (workspace = an API-key/credential property).
* **Effort to adopt:** hours (fleet auth config).
* **Composability:** precondition for F2/F3; pairs with the jackin' launcher (F6).
* **Validation protocol:** launch two containers in different dirs under one workspace with identical
  prefixes; assert container 2's first call shows `cache_read>0` on the system segment.

### F2. excludeDynamicSections — make the prefix byte-identical across containers [#f2-excludedynamicsections--make-the-prefix-byte-identical-across-containers]

Six per-container fields sit ahead of the prefix and bust it; move them into the first user message.

* **Coverage-delta:** Volume I 13 tech 7 names the flag; **the enumerated six fields, the measured
  \~111-token size, the version gating, and the mixed-cohort trap** are new (and partially close
  Volume I Gaps #6).
* **Layer:** input / system-prompt structure.
* **Mechanism:** the `claude_code` preset embeds working directory, git-repo flag, platform, shell, OS
  version, and auto-memory paths (\~111 tokens measured; \~201 with a short git status) ahead of any
  append text. `excludeDynamicSections:true` (TS SDK ≥ v0.2.98) / `exclude_dynamic_sections:True`
  (Python ≥ v0.1.58) / CLI `--exclude-dynamic-system-prompt-sections` relocates them into the first
  user message so the system prefix is byte-identical fleet-wide. Older clients silently ignore it.
* **Expected savings:** the blast radius, not the 111 tokens: each differing dynamic block forces a
  full \~28–34k-token prefix rewrite; sharing it saves (N−1) × prefix × (1.25–2.0 − 0.1) cap-weighted
  or dollar-weighted tokens per TTL.
* **Evidence tier:** T1 (SDK docs + enumerated fields) + local measurement of a
  representative block (ESTIMATE for the exact SDK-emitted size, which needs the SDK to measure).
* **Quality risk:** &#x2A;*QUALITY-TRADE (mild, documented):** the six fields "carry marginally less
  weight" in a user message; no quantified accuracy delta published. Enable when fleet reuse beats
  maximally-authoritative env context.
* **Availability:** SDK / CLI flag.
* **Effort to adopt:** hours (flag + byte-aligning model/effort/tool-set/append text across the fleet).
* **Composability:** requires F1 (same workspace); pin SDK version ≥ floor to avoid F2's cohort split.
* **Validation protocol:** `count_tokens` the preset system prompt with the flag on vs off to measure
  *your* dynamic-section size; then launch two relocated-context containers and confirm shared
  `cache_read`; run an env-sensitive task ("which directory are you in?") to confirm correctness.

### F3. Fleet cold-start — pre-warm once, then staggered fan-out [#f3-fleet-cold-start--pre-warm-once-then-staggered-fan-out]

A simultaneous N-container launch on a cold prefix pays N writes; one warm-up converts that to 1
write + (N−1) reads.

* **Coverage-delta:** Volume I 13 tech 8 covers the concurrency rule for subagent waves; applying it
  to a **container fleet cold-start** with the pre-warm-then-fan-out pattern is new framing.
* **Layer:** cache / launch orchestration.
* **Mechanism:** a cache entry is readable only after the first response begins streaming; N parallel
  cold requests all pay the write. Fire one warm-up (`max_tokens:0`, which writes the cache and bills
  zero output) or one real request, await its first token, then launch the remaining N−1 — they read
  at 0.1×.
* **Expected savings:** for N containers on prefix P: naive N × (1.25–2.0)P vs 1 × (1.25–2.0)P +
  (N−1) × 0.1P. At N=25, P=30k, Opus 4.8 1h-TTL: \~$15 → \~$1.5 on the write line (ESTIMATE) — a \~10×
  cut on fleet cold-start input.
* **Evidence tier:** T1 (concurrency + `max_tokens:0` docs); ESTIMATE for fleet dollars.
* **Quality risk:** **NEUTRAL** (adds one TTFT of latency to the wave start — file 43).
* **Availability:** SDK (orchestrator awaits first token before fan-out).
* **Effort to adopt:** hours (launch sequencing in the orchestrator).
* **Composability:** essential with F1/F2; pairs with 1h TTL for bursty fleets (writes survive the
  wave); the jackin' launcher is the natural home (F6).
* **Validation protocol:** launch a wave simultaneously vs warm-then-fan-out; diff first-call
  `cache_read` (≈0 vs ≈prefix) and total `cache_creation`.

### F4. Audit subagent caching in fan-out fleets — it may be off by default [#f4-audit-subagent-caching-in-fan-out-fleets--it-may-be-off-by-default]

A fan-out fleet's biggest cache loss may not be sharing but that subagents cache nothing at all.

* **Coverage-delta:** New (GitHub #29966, post-Volume-I). Volume I measured cavecrew/Claude Code
  subagents *writing* 5m cache (13 tech 2); this reports Agent-tool subagents with caching **off** —
  a possible version/path-dependent conflict, flagged in 49.
* **Layer:** cache / multi-agent.
* **Mechanism:** Agent SDK / Claude Code subagents spawned via the Agent tool reportedly hardcode
  `enablePromptCaching=false` (vs the main REPL defaulting true), so each subagent call pays full
  uncached input — one measured session wasted \~378k tokens across 54 subagent calls (\~7,013 uncached
  each). Issue open, unfixed at access date (Claude Code 2.1.63 / SDK 0.2.63).
* **Expected savings:** if confirmed in your version, enabling subagent caching (or routing fan-out
  through cached paths) recovers the full uncached input of every subagent call — potentially the
  largest single fleet leak.
* **Evidence tier:** T3 (one community-measured session; issue open, not Anthropic-confirmed). **Verify
  in your own version before acting** — Volume I's own measurement showed subagents writing cache, so
  this is version/path-specific.
* **Quality risk:** **NEUTRAL** (caching is transparent).
* **Availability:** depends on SDK/CLI version; audit via JSONL (`cache_read`/`cache_creation` on
  subagent calls).
* **Effort to adopt:** minutes to audit; the fix is upstream.
* **Composability:** interacts with Volume I 13 tech 4 (subagent fan-out economics assume caching);
  if caching is off, fan-out is far more expensive than Volume I modeled.
* **Validation protocol:** from subagent JSONL, check whether `cache_read_input_tokens` is ever >0; if
  always 0 with a >1,024-token repeated prefix, caching is disabled.

### F5. Multi-tenant cache architecture — org is the wall, workspace is the sharing unit [#f5-multi-tenant-cache-architecture--org-is-the-wall-workspace-is-the-sharing-unit]

Where you draw the org/workspace boundary decides whether tenants share or duplicate the system
prefix.

* **Coverage-delta:** New. Volume I is single-operator; org/workspace multi-tenant cache boundaries
  are not analyzed.
* **Layer:** infra / tenancy.
* **Mechanism:** caches never cross orgs (hard isolation, a security/privacy guarantee). Within an org,
  caches isolate per workspace on API/AWS-Platform/Foundry. So: tenant-per-org = maximal isolation,
  zero shared-prefix benefit (each tenant re-writes the system prefix every TTL); tenant-as-workspace
  in one shared org = the common system prefix pools across tenants while tenant data stays in the
  message body (uncached or separately keyed).
* **Expected savings:** for M tenants sharing a 30k system prefix in one org: M writes → 1 write +
  (M−1) reads per TTL — but only if tenancy is modeled as workspaces, not orgs. Weigh against the
  isolation requirement.
* **Evidence tier:** T1 (org/workspace isolation docs).
* **Quality risk:** **NEUTRAL** for cost; the **security** tradeoff (shared-org cache vs per-org
  isolation) is the real decision — keep tenants that must not share a cache in separate orgs.
* **Availability:** SDK / GATEWAY (tenancy = credential/workspace design).
* **Effort to adopt:** project (tenancy architecture).
* **Composability:** sets the boundary within which F1–F3 operate.
* **Validation protocol:** confirm cross-tenant cache behavior matches intent (a second tenant in the
  same workspace reads the shared prefix; a tenant in a separate org does not).

### F6. jackin'-baked fleet cache policy — centralize what every container must inherit [#f6-jackin-baked-fleet-cache-policy--centralize-what-every-container-must-inherit]

The launcher is the one place to enforce workspace pinning, the dynamic-section flag, pre-warm,
version floor, and SDK-credit placement so the whole fleet inherits cache sharing without per-user
discipline.

* **Coverage-delta:** Volume I K16 (`20`) proposes baking the optimization pack into jackin'; the
  **specific hosted-fleet items** (workspace auth, excludeDynamicSections, cold-start pre-warm,
  SDK-version floor, off-cap credit placement) are new and concrete.
* **Layer:** infrastructure (all fleet layers).
* **Mechanism:** jackin's launch env assembly is the chokepoint: pin every container to one workspace
  (F1), set `--exclude-dynamic-system-prompt-sections` / the SDK flag (F2), warm the shared prefix
  once before fanning out the wave (F3), enforce the SDK version floor (avoid F2's cohort split / F4),
  and place non-interactive fleet work on the separate API-rate SDK credit so it
  does not starve the operator's interactive cap (41).
* **Expected savings:** the de-duplicated fleet sum: one cached prefix instead of N; F3's \~10×
  cold-start cut; cap-protection from credit placement — delivered as infrastructure, not discipline.
* **Evidence tier:** T1 components (each lever above) + T4 for the integrated fleet total (no published
  integrated fleet number).
* **Quality risk:** **NEUTRAL-to-NEGATIVE-COST** (defaults encode the safe variants; F2's mild
  authority tradeoff is the only quality note).
* **Availability:** BUILDABLE (jackin' roadmap; insertion points mapped in Volume I 20 K16 / 32).
* **Effort to adopt:** high once, amortized across every launch.
* **Composability:** the composition layer for F1–F5; must respect tool-list stability (any tool-def
  change busts the whole fleet's prefix).
* **Validation protocol:** per-fleet before/after: cache-read ratio on container 2..N's first call,
  cold-start write total, and (subscription) interactive cap-% with fleet work on the SDK credit vs on
  the seat.

***

## Surprising findings [#surprising-findings]

* The hosted server cache is **more** shareable than Volume I implied: no machine/worktree key means a
  fleet across many containers and directories *can* read one cached prefix — the only real barriers
  are workspace/org and a single varying dynamic field.
* The thing that breaks fleet sharing is tiny (\~111 tokens) but its blast radius is the whole \~28–34k
  prefix — a textbook case of cache economics being about *position*, not size.
* A fan-out fleet's worst cache outcome may be the opposite of a sharing failure: subagents that cache
  *nothing* (#29966). The fix is not better sharing but turning caching on.
* On a subscription, the fleet's binding constraint is the **single pooled cap** (41), not dollars —
  so the most important fleet lever may be placing non-interactive containers on the off-cap
  SDK credit rather than any cache trick.

## Verification ledger [#verification-ledger]

| #  | Number / claim                                                                                                                                           | Source (access)                                                                    |
| -- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- |
| 1  | Server cache workspace-isolated (API/AWS-Platform/Foundry); org-level on Bedrock/Vertex; org wall absolute                                               | platform.claude.com/docs/en/build-with-claude/prompt-caching                       |
| 2  | No machine/dir/worktree key on server cache (that is Claude Code's local file cache / GitHub #17531)                                                     | platform.claude.com prompt-caching; github.com/anthropics/claude-code/issues/17531 |
| 3  | excludeDynamicSections moves 6 fields (cwd, git flag, platform, shell, OS version, auto-memory); TS ≥0.2.98 / Py ≥0.1.58; CLI flag; older clients ignore | code.claude.com/docs/en/agent-sdk/modifying-system-prompts                         |
| 4  | Dynamic block ≈111 tok (6 fields) / ≈201 with short git status; blast radius = full \~28–34k prefix                                                      | local count\_tokens on a representative block + Volume I prefix size (02/13)       |
| 5  | Concurrency rule (entry readable after first stream begins); N cold = N writes; pre-warm/stagger → 1 write + (N−1) reads                                 | platform.claude.com prompt-caching                                                 |
| 6  | Subagent enablePromptCaching=false default; \~378k wasted tok/54 calls/session (open, unfixed)                                                           | github.com/anthropics/claude-code/issues/29966 (T3, one session)                   |
| 7  | Min cacheable prefix Opus 4.8 = 1,024; Fable 5 = 512; Haiku 4.5 = 4,096 (matches Volume I 13)                                                            | platform.claude.com prompt-caching                                                 |
| 8  | cache\_hint/context\_hint = sync-only routing hints (named in batch exclusions; no parameter reference)                                                  | platform.claude.com/docs/en/build-with-claude/batch-processing                     |
| 9  | Cache multipliers 0.1× read / 1.25× 5m write / 2× 1h write; Opus 4.8 base $5                                                                             | platform.claude.com prompt-caching                                                 |
| 10 | Subscription fleet shares one pooled cap; headless/SDK draws separate API-rate credit                                                                    | support.claude.com (file 41 ledger)                                                |
