# Auth overwrite on new tab — root cause, fix, and long-term strategy (https://jackin.tailrocks.com/reference/roadmap/auth-overwrite-on-new-tab/)


**Status**: Partially implemented — new-tab auth overwrite fixed; live cross-session token sync deferred to the [jackin' daemon](/reference/roadmap/jackin-daemon/) program

## Problem [#problem]

Opening a new agent tab inside a running jackin' container can silently revoke authentication for every existing session in that container. The failure presents as a sudden `401 authentication_error` or a `Please run /login` prompt mid-session — in a session that was working moments before, with no network change or token revocation on the host.

The root cause is a code path in `jackin-capsule`'s runtime setup that was designed to copy host credentials into the container on first boot, but ran unconditionally on every new-tab spawn, overwriting credentials that the agent had already refreshed during its session.

## Root cause [#root-cause]

When a new tab is opened — either an agent tab or a shell tab — the host runs `docker exec <container> /jackin/runtime/jackin-capsule new <agent>`. The capsule's `new` dispatch path calls `runtime_setup::run()`, which historically invoked `run_agent_setup()` on every execution, regardless of whether the container was booting for the first time or resuming for a new session.

`run_agent_setup()` dispatches to a per-agent `setup_*()` function — `setup_claude()`, `setup_codex()`, `setup_amp()`, `setup_kimi()`, `setup_opencode()`, `setup_grok()`. Each of these unconditionally copied the provisioned credential snapshot from the `/jackin/<agent>/` bind-mount into the agent's live config directory (`~/.claude/.credentials.json`, `~/.codex/auth.json`, etc.).

The `/jackin/<agent>/` mount is a bind-mount of a file that was written once, at `RoleState::prepare()` time in <RepoFile path="crates/jackin-runtime/src/instance/auth.rs">crates/jackin-runtime/src/instance/auth.rs</RepoFile>, during the initial container launch. That snapshot is frozen at the moment the container was first created. It does not update when the host's credentials rotate, and it does not reflect token refreshes the agent performs inside the container.

Claude Code, Codex, and other agents perform OAuth refresh-token rotations automatically during long sessions. After a rotation, the live in-container credential file (`~/.claude/.credentials.json`) holds the new token; the older snapshot at `/jackin/claude/credentials.json` holds the original, now-invalid token. The moment the operator opens a new tab, `setup_claude()` runs and overwrites the live file with the stale snapshot. Both the new tab and all existing tabs now present the revoked token on their next API call and receive `401`.

The specific sequence that triggers the failure:

1. Container starts; credentials copied from host snapshot → `~/.claude/.credentials.json`.
2. Agent runs for a while; Claude Code automatically refreshes the OAuth token; `~/.claude/.credentials.json` now holds the new token.
3. Operator opens a new tab (agent or shell).
4. `jackin-capsule new claude` runs `setup_claude()` unconditionally.
5. `setup_claude()` copies `/jackin/claude/credentials.json` (stale original snapshot) → `~/.claude/.credentials.json` (overwrites the fresh token).
6. Both tabs now hold the revoked token. Next API call from either tab → `401`.

The failure was intermittent because OAuth refresh rotations are not deterministic — they happen when the token is near expiry, not on a fixed schedule. Short-lived sessions and sessions that never hit the refresh window worked fine, which made the bug hard to reproduce on demand.

The same overwrite pattern existed for every agent: Codex's `auth.json`, Amp's `secrets.json`, Kimi's credential directory, and OpenCode's `auth.json` all followed the same unconditional copy-on-every-exec shape.

## Fix shipped in this PR [#fix-shipped-in-this-pr]

The fix gates every credential copy on whether the requested agent has already been initialized inside the current container. Each agent gets its own marker under `/jackin/state/agent-auth/<agent>.done`.

`runtime_setup::run()` still performs container init once via `/jackin/state/container-init.done`, but `run_agent_setup()` now decides `copy_auth` from the per-agent marker. When `copy_auth` is `false`, the credential copy blocks are skipped entirely. The non-credential setup — `seed_home_dir()` (uses `CopyMode::SkipExisting`, already idempotent), MCP tool registration (`claude mcp add tirith`, `claude mcp add shellfirm`), and provider config writes (Codex MiniMax block, OpenCode JSON) — continues to run on every new-tab invocation because those are either idempotent or depend on stable env vars that do not change between tabs.

The first tab for an agent in a container sees no marker, so `copy_auth` is `true`, the agent gets the same initial auth bootstrap it always received, and setup writes that agent's marker after success. Later tabs for the same agent see the marker, so `copy_auth` is `false` and whatever credentials the agent has on disk are left untouched. A different agent opened later in the same container has its own marker, so its first tab still gets its auth bootstrap.

The relevant files changed:

* <RepoFile path="crates/jackin-capsule/src/runtime_setup.rs">crates/jackin-capsule/src/runtime\_setup.rs</RepoFile> — `run()`, `run_agent_setup()`, `setup_claude()`, `setup_codex()`, `setup_amp()`, `setup_kimi()`, `setup_opencode()`, `setup_grok()`.

## What this fix does not solve [#what-this-fix-does-not-solve]

The new-tab overwrite is now closed, but the auth snapshot model has two remaining fragility vectors that this fix does not address.

**1. Container re-launch after host token rotation.** When the operator ejects and reloads a container, `RoleState::prepare()` runs again, reads the host's current credential file, and writes a new snapshot. If the host token is currently valid this works correctly. The risk is in the inverse direction: if the host token has rotated (e.g. because another running container triggered an OAuth refresh and the new token landed in the host's credential file) the new container gets the latest host token — but any *other* already-running container is still holding the older token from its own launch-time snapshot. This is the "parallel containers sharing one account" invalidation cascade described in the [live bidirectional auth sync](/reference/roadmap/live-auth-sync/) roadmap item.

**2. In-container token refresh does not propagate back to the host or sibling containers.** When Claude Code inside one container refreshes the OAuth token, the new token lives only in that container's writable layer. The host's `~/.claude/.credentials.json` still holds the pre-refresh value. If the operator opens a second container or if a sibling container's access token expires, it will attempt to use its own stale token — which the OAuth server may have already invalidated as a side effect of the first container's rotation.

Both of these are the same class of problem — stale credentials in one of the participants of a multi-process auth lifecycle — and they require a different class of solution: a long-running host process that watches credential files for changes and propagates updates in real time.

**3. The per-agent marker persists across container recreation.** The marker this fix adds lives at `/jackin/state/agent-auth/<agent>.done`, on the `/jackin/state` host bind-mount (`~/.jackin/data/<container>/state/`). That directory survives when a container is removed and recreated under the same name, which is exactly what the restore / `jackin hardline` path does (it reuses the recorded container name and its state directory). So a restore that recreates the container after a host re-login finds the marker already present, skips the credential copy, and the recreated container keeps running the pre-rotation token until the operator launches a genuinely fresh container or the live-sync protocol below removes the snapshot-copy model entirely. This is a deliberate trade — the fix prioritizes the frequent new-tab case over the rarer restore-after-host-rotation case — and `run_agent_setup` emits a warning when it takes the skip path but finds no credential file on disk, so the degraded state is visible in the capsule log instead of presenting as a silent unauthenticated start. The live-sync protocol is the structural fix; until it lands, restore-after-host-rotation is a known limitation.

## Long-term solution: live auth sync via the jackin' daemon [#long-term-solution-live-auth-sync-via-the-jackin-daemon]

The correct long-term answer is the [Live bidirectional auth sync](/reference/roadmap/live-auth-sync/) feature, which depends on the [jackin' daemon](/reference/roadmap/jackin-daemon/). The architecture is already designed in those two roadmap items; this section analyzes the specific properties that make it the right answer to the auth fragility problem, along with the challenges it introduces.

### What the daemon provides that the fix does not [#what-the-daemon-provides-that-the-fix-does-not]

The daemon is a long-running per-operator-user host process. For auth sync, it does two things the fix cannot: it watches credential files for changes (via inotify on Linux, polling with optional Keychain callbacks on macOS) and it propagates those changes into running containers without requiring a restart. The propagation path uses a flock-protected shared store at `~/.jackin/auth-shared/<axis>/`, bind-mounted into containers so both sides see the same file; the in-container watcher (`jackin-auth-watcher`, a small static binary baked into the construct image) performs the symmetric in-container → shared-store push when an agent refreshes.

The result is that token rotation — whether initiated by the host or by an agent inside a container — becomes visible to all participants within seconds, without tearing down any session. The `401` class of failure becomes structurally impossible for operators who have enabled live sync mode.

### Challenges and open questions [#challenges-and-open-questions]

**OAuth single-grant safety.** OAuth refresh-token rotation is not a broadcast operation — only the party that holds the current refresh token can issue a new one. If two containers both hold a stale access token and both attempt a refresh simultaneously, one will succeed and the other will receive a `400 invalid_grant`. The daemon's last-writer-wins conflict resolution (by `mtime + checksum`) handles the propagation race after one party wins, but it does not prevent both parties from attempting the refresh before either sees the other's result. The live-auth-sync roadmap item notes this concern explicitly and defers the protocol-level answer to the design pass that picks the daemon shape: the most plausible mitigation is a per-axis advisory lock in the shared store that a container acquires before initiating a refresh and releases after the new token is written, so siblings wait rather than race.

**macOS Keychain write path.** On macOS, Claude Code and GitHub CLI store OAuth tokens in the system Keychain, not in a plaintext file. The daemon's host-side watcher must read from the Keychain (`security find-generic-password`) and write back to it (`security add-generic-password -U`) when a container-initiated rotation produces a new token. The read path is straightforward. The write path requires the daemon process to have Keychain access, which introduces a macOS-specific entitlements question that the daemon design pass must answer before implementation starts.

**Token visibility in the shared store.** The flock-protected shared store at `~/.jackin/auth-shared/<axis>/` is a plaintext file on disk. Any process running as the operator user can read it. This is not a widened attack surface compared to today — the credential files in `~/.claude/`, `~/.codex/`, etc. are already readable by the same user — but it is a centralization that creates a single path worth hardening. The [container credential exposure](/reference/roadmap/container-credential-exposure/) roadmap item is the longer-term answer; the shared store is an acceptable intermediate step.

**Daemon lifecycle.** The daemon must survive across container restarts and operator sessions. On macOS this means a LaunchAgent plist; on Linux a systemd user unit. The first-launch auto-install UX, version-skew handling between the CLI and a running daemon, and crash/restart policy are all open design questions captured in the daemon roadmap item. The live-sync adapter should not be designed until Phase 1 of the daemon ships, so the adapter can plug into the already-decided lifecycle rather than inventing its own.

**Backward compatibility with `sync` mode.** Today's `sync` mode (snapshot-at-launch) and the proposed `live` mode share the same semantic label space. The live-auth-sync roadmap item proposes renaming `sync` to `forward` or `snapshot` when live sync ships, so the word `sync` can be reclaimed for continuous bidirectional sync. That rename is a schema-version bump across `config.toml`, workspace files, and role manifests — all three versioned schemas will need migration steps. The rename should not happen before live sync ships (churn without payoff), and it must not be split across multiple PRs.

### Implementation phasing connecting this fix to the long-term answer [#implementation-phasing-connecting-this-fix-to-the-long-term-answer]

1. **This PR** — new-tab overwrite closed per agent. Auth on disk for each initialized agent inside a container is now stable for the container's lifetime unless the operator explicitly re-launches.
2. **[jackin' daemon Phase 1](/reference/roadmap/jackin-daemon/)** — lifecycle, install, control socket, log redaction. No watchers yet.
3. **[jackin' daemon Phase 4 — live auth sync](/reference/roadmap/live-auth-sync/)** — per-axis watcher adapters (`gh`, Claude, Codex, Amp, Kimi, OpenCode). Shared store bind-mount replaces per-container provisioned-snapshot mount for containers in live mode. In-container `jackin-auth-watcher` binary added to the construct image.
4. **`sync` → `forward`/`snapshot` rename** — schema version bump in the same PR as Phase 4.
5. **Full resolution** — operators on `live` mode see no auth-drift failures regardless of how many parallel containers are running or how often OAuth rotates.

## Related work [#related-work]

* [Auth reliability and convenience program](/reference/roadmap/auth-reliability-program/) — umbrella that places this fix as Phase 0 in the complete auth reliability sequence; read it for how this item connects to live sync, health visibility, and multi-company isolation.
* [jackin' daemon](/reference/roadmap/jackin-daemon/) — the umbrella long-running-process item. Lifecycle, install, control socket, and security posture are decided there. Live auth sync is a Phase 4 adapter against the daemon's plug-in surface.
* [Live bidirectional auth sync](/reference/roadmap/live-auth-sync/) — detailed architecture for the shared store, per-axis adapters, in-container watcher, and conflict resolution. The long-term answer to the auth fragility problem this PR partially addresses.
* [Reliable Claude authentication strategy](/reference/roadmap/claude-auth-strategy/) — design history for the `sync` / `token` / `ignore` mode set; the concurrent-session token-drift concerns documented there are what this PR's root-cause fix narrows.
* [Container credential exposure](/reference/roadmap/container-credential-exposure/) — threat model for tokens in containers. The shared store introduced by live sync is the next hardening target in that trajectory.
* <RepoFile path="crates/jackin-capsule/src/runtime_setup.rs">crates/jackin-capsule/src/runtime\_setup.rs</RepoFile> — the fixed file; `run()`, `run_agent_setup()`, and `setup_*()` are the changed functions.
* <RepoFile path="crates/jackin-runtime/src/instance/auth.rs">crates/jackin-runtime/src/instance/auth.rs</RepoFile> — `provision_*_auth()` functions that write the launch-time credential snapshot. The snapshot model is unchanged by this PR; it remains the input to first-boot auth bootstrap.
* <RepoFile path="crates/jackin-runtime/src/runtime/launch.rs">crates/jackin-runtime/src/runtime/launch.rs</RepoFile> — `agent_mounts()` that bind-mounts the provisioned snapshot into the container.