Auth health and operator visibility

Status: Open — design proposal

Problem

Auth failures in jackin' containers are invisible until they happen. The operator's first signal is a 401 or Please run /login prompt mid-session — inside a container that was working moments before, with no warning at launch and no way to audit auth state across running containers without entering each one individually. The underlying causes (expired token, missing credential file, stale sync snapshot, concurrent-session rotation) all look the same from inside the container and none of them surface proactively.

The current auth stack has no health layer. jackin' provisions credentials, launches the container, and then stops tracking. Whether the provisioned credentials are valid, how long they remain valid, and what state they are in across several simultaneously running containers is invisible to the operator.

This roadmap item adds the missing observability layer: pre-launch credential checks that surface problems before the container starts, a jackin auth status command and console indicators that let the operator audit health across all workspaces and running containers, and the foundation for a future recovery UX that turns silent failures into actionable prompts.

Why operator visibility is the right first step

The new-tab auth overwrite fix (shipped) and the live bidirectional auth sync design (planned, daemon-dependent) both address specific mechanisms of failure. But neither makes failures visible before they hurt, and neither gives the operator a way to answer "are all my running agent sessions currently authenticated?" without entering every container.

Operator visibility is orthogonal to live sync — it is useful before the daemon ships (because it surfaces problems the operator can fix manually) and it remains useful after the daemon ships (because it surfaces problems the daemon cannot fix, such as a token that was never valid in the first place, or an api_key env var that was never set for this workspace).

The design in this item is deliberately scoped to what can be built without a daemon — reading local credential files, checking env var presence, and parsing obvious expiry from token payloads. Phase 2 extends the same surface to query daemon-reported health when the daemon is available, turning the local snapshot into a live feed without replacing the local path.

What the feature covers

1. Pre-launch credential validity probe

Before docker run, jackin' checks the credentials that RoleState::prepare() is about to provision and emits a structured result into the existing launch summary. The check is local-only and lightweight — no network call, no round-trip to the auth provider. It reads files and env vars that are already accessible at provision time.

Per mode and agent:

Mode	Check
`sync`	Credential file exists at resolved source path; file is non-empty; file parses as valid JSON (for Claude `.credentials.json`, Codex `auth.json`, Amp `secrets.json`, OpenCode `auth.json`) or is a non-empty directory (Kimi `~/.kimi-code`). For Claude `.credentials.json`: decode the `oauthAccount.expiresAt` field if present and warn if expiry is within 7 days or already passed.
`oauth_token`	The resolved env var (`CLAUDE_CODE_OAUTH_TOKEN`) is set and non-empty; if the value is a JWT, decode the `exp` claim (no signature verification — just the payload) and warn within 7 days.
`api_key`	The resolved env var (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, etc.) is set and non-empty. API keys have no client-readable expiry.
`ignore`	Nothing to check; note in the summary that auth is not forwarded.
`token` (GitHub)	The resolved `GH_TOKEN` is set and non-empty; note scope pre-flight is a separate follow-up (tracked in GitHub CLI authentication strategy).
`sync` (GitHub)	`gh auth token --hostname github.com` exits zero; note the token is present but do not validate scopes.

The probe output lands in the existing launch summary — the structured display jackin' already emits before printing the container start line. Each auth axis gets a one-line result with a status icon: ✓ credentials found / ⚠ credentials not found / ⚠ token expires in N days / ✗ token expired / — auth not forwarded. The icons and line format must match the existing launch-summary field style so the summary reads as one coherent surface, not a mixed-format dump.

Failed probes are warnings, not hard errors, with one exception: if the mode is sync and no credential file was found at the resolved source path, the launch proceeds with a warning rather than blocking — the container may still work if the agent is in a mode that doesn't need the credential file (e.g. it uses an API key via env). A missing credential file for sync mode is the same state jackin' handles today (no forwarded auth, agent starts unauthenticated). The probe makes that state visible rather than silent.

If JACKIN_DEBUG=1, the probe output expands: it logs the resolved source path or env var name, the full credential structure (with token values redacted — same redaction model as GithubAuthContext::Debug), and the exact reason for any warning.

2. `jackin auth status` command

A new CLI subcommand that prints the auth health of each configured workspace and, when containers are running, each running instance.

jackin auth status

Output shape (example):

Workspace: work-project
  Role: the-architect
    claude     sync  ✓  ~/.config/company/.claude  (credentials found)
    github     sync  ✓  (gh auth token ok)
    codex      sync  ⚠  ~/.codex/auth.json  (file not found — agent will start unauthenticated)

Workspace: personal
  Role: agent-smith
    claude   token  ✓  CLAUDE_CODE_OAUTH_TOKEN (set, expires in 280 days)
    github  ignore  —  (not forwarded)
    codex    sync  ✓  ~/.codex/auth.json

Running instances:
  jk-abc123  work-project / the-architect   claude ✓  codex ⚠  github ✓
  jk-def456  personal / agent-smith         claude ✓  github —  codex ✓

The running-instance rows read from the per-instance state directory (~/.jackin/data/<instance>/) — specifically whether the provisioned credential files are present and non-empty — not from inside the container. This is local-only in Phase 1. Phase 2 adds a --live flag that queries each running container's daemon-reported health via the jackin' daemon socket when available.

jackin auth status <workspace> scopes output to one workspace.

jackin auth status --json emits machine-readable JSON for scripting and for the console's internal health-check path.

The command reads the same config resolution chain (resolve_mode, resolve_sync_source_dir) that the launch path uses, so the output reflects what a real launch would actually provision — not what is in the raw config files.

3. Console auth health indicators

The console surfaces auth health in two places.

Auth tab in Settings and workspace editor. Every agent row gains a health indicator column showing the result of the same local probe the CLI uses. When sync is active and the credential file is missing, the row shows a ⚠ with "not found" inline. When oauth_token mode is used and the token expires within 7 days, the row shows a ⚠ with "expires soon". The indicator refreshes when the Auth tab is focused. This makes the Auth tab a diagnostic surface, not just a configuration surface.

Instance row in the workspace tree. Each running instance in the workspace sidebar gains a small auth-health glyph (✓ / ⚠ / ✗) computed from the provisioned credential state at the time the instance was launched. The glyph is informational — pressing Enter on the instance still navigates to the session, not to an auth detail pane. A future detail pane is out of scope for this item.

In-container session indicator (deferred to Phase 2). Showing live health for running instances (e.g. "agent is currently experiencing auth errors") requires the agent runtime status authority or daemon-reported state — out of scope here.

4. Auth health in the launch summary

The pre-launch probe output (Section 1) slots into the existing launch summary at the auth field rows that jackin' already emits. Currently those rows say things like claude auth: forwarded host session (sync). After this feature ships they expand to:

claude auth: sync — ✓ credentials found (last modified: 2 days ago)
codex auth:  sync — ⚠ credentials not found (agent will start unauthenticated)
github auth: sync — ✓ gh token present

When JACKIN_DEBUG=1 the row adds the resolved path:

claude auth: sync — ✓ credentials found (~/.config/company/.claude/.credentials.json, last modified: 2d ago)

This does not require any launch-summary structural changes — it only extends the string value of the existing per-agent auth field.

What this does not cover

Live in-container auth health — knowing whether a running agent is currently hitting 401 requires the agent runtime status authority or the daemon. Phase 1 is static: the probe reads the provisioned state at launch time, not the live container state.
Token validity via network call — checking whether a token is actually accepted by the provider API is out of scope. The probe is local only. A failed network probe would make launches unreliable in air-gapped or offline environments.
GH token scope verification — whether the provisioned GitHub token has the scopes jackin' needs is tracked in the open questions of GitHub CLI authentication strategy.
Recovery UX — surfacing an inline "run /login now" or "run jackin workspace claude-token setup" prompt when the probe fails is a natural follow-on but requires the approval-surface work from host bridge. Phase 1 names the problem and points to the fix; it does not automate the fix.

Architecture

Probe implementation

The probe is a pure function probe_auth_health(mode, source_dir, env) -> AuthHealthResult defined in crates/jackin-runtime/src/instance/auth.rs alongside the existing provision_*_auth functions. It takes the same resolved inputs that provision_*_auth takes — the resolved mode, the resolved source path, and the resolved env — so it can be called from both the pre-launch path and the CLI command without duplicating resolution logic.

AuthHealthResult carries:

status: AuthHealthStatus — Healthy, Warning(reason), Error(reason), NotForwarded
resolved_path: Option<PathBuf> — the credential file or directory that was checked, for display
expires_at: Option<SystemTime> — decoded expiry when available, for the "expires in N days" message

The result flows into the launch summary formatter and into jackin auth status's display layer via the same struct, so the two surfaces show consistent health information.

CLI implementation

jackin auth status is a new subcommand under jackin — add the CLI module under crates/jackin/src/cli/ and dispatch from crates/jackin/src/app.rs, mirroring the shape of jackin workspace list. It reads config (same AppConfig::load_or_init path), walks all configured workspaces (or the named one), resolves auth config per workspace × role, and calls probe_auth_health for each agent × workspace pair. For running instances it reads the per-instance manifest directory (~/.jackin/data/<instance>/) to enumerate which instances are live and which provisioned credential files they carry.

Phase 2 adds --live which, when the daemon socket is present at ~/.jackin/run/jackin-daemon.sock, makes a health_query request per running instance and merges the daemon-reported health into the display.

Console integration

The Auth tab health indicators reuse AuthHealthResult by calling probe_auth_health lazily when the tab is focused, caching the result per auth axis until the tab is re-focused or the workspace config changes. The console's existing async background-task pattern (used for the workspace tree population and the 1Password picker) is the right shape here — the probe is fast (local I/O only) but the number of workspace × agent combinations can be large enough that a synchronous call on the draw thread is inadvisable.

Phases

Probe + launch summary — AuthHealthResult, probe_auth_health, launch-summary integration. No new commands, no console changes. Operators see auth health at every launch without any new interaction.
jackin auth status CLI — local-only, no daemon. Scoped output, JSON flag. Operators can audit auth health without launching a container.
Console Auth-tab indicators — background probe, health column in Auth tab, instance glyph in workspace tree. Operators see health without running CLI commands.
Daemon integration — --live flag on jackin auth status, in-container auth health from daemon when available. Operators see real-time health for running instances.

Implementation notes

The JWT decode for oauth_token mode and for Claude .credentials.json must not use a crypto library for signature verification — the goal is "read the exp claim", not "validate the token". A simple base64-decode of the payload segment (no crate needed for a three-field parse; use serde_json on the decoded bytes) keeps the dependency footprint minimal. The probe should never panic on malformed token payloads — return Warning("could not parse token expiry") rather than an error that stops the launch.

Redaction in debug output must follow the same pattern as GithubAuthContext's manual Debug impl — token fields are replaced with [REDACTED] in Debug formatting, never printed as-is. The probe must not introduce a new code path that logs raw token values.

Auth overwrite on new tab — shipped fix for the most common mid-session auth failure; health visibility is the complementary surface that makes the remaining failure modes visible before they hurt.
Live bidirectional auth sync — Phase 4 daemon adapter. Phase 4 of this item (--live flag) consumes the same daemon socket.
Workspace Claude token setup — the remaining pre-launch validity probe work listed in that item moves here; this item is the canonical home for all credential health checking.
jackin' daemon — Phase 4 dependency. Local probes and jackin auth status ship before the daemon; the daemon extends them.
GitHub CLI authentication strategy — GH token scope pre-flight is tracked there; this item covers only presence and format checks.
Agent runtime status authority — future source for live in-container auth failure signals. Out of scope for this item's Phase 1–3.
Agent Authentication → Choosing a sync source folder — shipped per-scope source-folder support. probe_auth_health must take the resolved source path as input so it checks the same account a launch would provision.

Auth health and operator visibility

On this page