jackin'
RoadmapAgent Orchestrator ResearchFleet phase 2 — Live operator surface

Session Keep and Resume

Status: Open — design proposal (Phase 2 — Live operator surface, Agent Orchestrator Research Program)

Problem

When an operator leaves a jackin' session, the exit path silently discards more than the operator expects, and the next launch cannot bring them back. This is the concrete cause of the "sometimes I lose important context" instability operators report.

Trace a normal clean exit (Stopped/0) today: crates/jackin-runtime/src/isolation/finalize.rs assesses each isolated mount, auto-deletes any worktree it judges safe (force_cleanup_isolated, no prompt), returns FinalizeDecision::Cleaned, the launcher tears the container down, and crates/jackin-runtime/src/runtime/launch.rs stamps the instance InstanceStatus::CleanExited. The per-instance agent home (~/.jackin/data/<container>/home/.claude, .claude.json, installed plugins, conversation history, every other runtime's home) is preserved on disk — but CleanExited is not a restore candidate (is_restore_candidate() in crates/jackin-runtime/src/instance/manifest.rs matches Active | Running | Crashed | PreservedDirty | PreservedUnpushed | RestoreAvailable | FailedSetup, never CleanExited). So the next jackin load / jackin console mints a fresh instance_id, a fresh container name, a fresh data directory, and an empty home. Yesterday's conversation is stranded on disk with no path back to it.

Two narrower problems compound it:

  • There is no "keep to resume" outcome for a clean session. Preservation only triggers when an isolated mount has uncommitted or unpushed Git state. A clean session you simply want to continue tomorrow has no button.
  • The exit prompt that does exist is hard to read on degraded terminals. The rich worktree-cleanup dialog now renders unconditionally (the plain-text fallback was removed), so a too-small or TERM=dumb terminal shows an unreadable surface for both the choice and its error popup.

The mirror-image failure is just as damaging: a clean exit leaves the per-instance data directory and its index entry behind forever. LoadCleanup::run in crates/jackin-runtime/src/runtime/launch.rs removes the container, DinD sidecar, certs volume, and network, but never touches ~/.jackin/data/<container>/, and it deliberately calls keep_socket_dir() so even the socket directory lingers. The instance is stamped clean_exited in ~/.jackin/data/instances.json but the row is never removed. The result, observed live, is a data directory holding a dozen jk-…-jackin-thearchitect/ folders and an instances.json still listing seven clean_exited instances whose containers are long gone, plus orphaned .lock siblings and stray *.repo.lock files. Every launch leaks; nothing reaps unless the operator remembers to run jackin prune instances by hand. A terminal outcome must clean up after itself — change the status, delete the filesystem state, and drop the index row — in the same exit path, not in a command the operator has to know exists.

The fix is the flow the operator described: a clear, rich exit decision — keep this to resume, or clean it up — governed by a per-workspace policy, and a rich resume-or-new choice at the next launch that rebinds the same persisted instance so the conversation, plugins, and isolated worktree come back as they were.

Goals

  • An interactive exit presents a legible, rich choice between keep to resume later and clean up now, with a graceful plain-text degrade when the terminal cannot render the rich surface.
  • A per-workspace cleanup policy controls the default: ask (today's behaviour — only interrupt on unfinished Git state, clean exits clean up automatically), keep (always retain, always resumable from the selection bar), and clean (always force-clean, no prompt). The product default is ask.
  • A kept instance becomes a first-class restore candidate, so the next time the operator selects that workspace + role + agent they are offered resume previous session vs start new.
  • Resuming rebinds the same instance — same container_base, same ~/.jackin/data/<container>/ — so home, plugins, auth, conversation, and the isolated worktree/clone return in place. A power-off or docker rm that removed the container but left the host data directory is fully recoverable.
  • Restore degrades by tier and reuses before it rebuilds: reconnect to a still-running container, docker start a stopped one, recreate a deleted one from the stored launch recipe reusing the same image, and rebuild the image only when it too is gone — pinned to the launch-time snapshot so the operator returns to the state that was working, never a re-resolved different one.
  • Restore never caches resolved secrets. The launch recipe stores references (op://… paths, env-var names, the GitHub auth mode), and 1Password values and tokens are re-asked / re-resolved on every restore.
  • Before reusing a clone or worktree mount on resume (and before discarding one on exit), the operator sees the actual uncommitted files and unpushed branches and acknowledges them. shared mounts are exempt — their data is the host's own and is never jackin-managed.

Background — most of the plumbing already exists

This item is roughly 70% assembly of shipped infrastructure, not greenfield. What already exists:

  • Durable per-instance host state. ~/.jackin/data/<container>/ survives docker rm: .jackin/instance.json (the manifest), .jackin/isolation.json (mount records), home/ (every agent runtime's home, including Claude plugins and .claude.json), the per-agent auth slots, and the materialised git/worktree/ and git/clone/ trees. See Runtime Instance Model.
  • A container-level restore flow. resolve_restore_candidate() in crates/jackin-runtime/src/runtime/launch.rs already presents an "Unfinished jackin instances" picker (start-fresh vs restore a prior instance, plus related-role recovery), and InstanceStatus::RestoreAvailable + mark_restore_available() already model "container gone, host state survives." This shipped as Unique container identity and restore.
  • Per-mount exit assessment. assess_cleanup() runs git status --porcelain, for-each-ref, and rev-list against every isolation record and fails closed (preserve) on any ambiguity. The PreservedDirty / PreservedUnpushed distinction already drives per-reason wording.
  • A rich dialog vocabulary. crates/jackin-runtime/src/runtime/progress.rs already exposes standalone_select_with_context, standalone_error_popup, text prompts, confirms, and the launch cockpit surface the exit and resume dialogs should reuse rather than reinvent.

The gap is the status model (clean exits orphan their home; "keep" is not an outcome), the policy (not configurable per workspace), and the operator surface (the exit/resume decisions are not framed as session continuity, and verification shows counts rather than content).

Current implementation review — flaws to address

Findings from auditing the exit/restore path and the in-flight rich-dialog change. These are the concrete defects this item should close.

IDSeverityFinding
F1HighClean exit orphans context: Stopped/0 → worktree auto-deleted → CleanExited, which is_restore_candidate() excludes, so the preserved home is unreachable on the next launch. Root cause of lost context.
F2HighNo "keep to resume" outcome exists for a clean session; preservation is gated entirely on dirty/unpushed Git state.
F3MediumThe restore picker is framed as mess-recovery ("Unfinished jackin instances"), not session continuity ("resume yesterday"), and does not include kept-clean instances.
F4MediumClone-mode mounts are assessed by worktree semantics incidentally (the record field is worktree_path but holds the clone dir too); there are no clone-specific tests, and force_cleanup_clone is rm -rf only.
F5MediumVerification surfaces a path plus "has uncommitted changes" — never the actual file list or unpushed branch list, and offers no acknowledge-the-detail step.
F6LowThe non-interactive preserve path only eprintln!s; nothing records whether the operator wanted to keep versus jackin' merely could not ask.
F7LowThe rich cleanup dialog renders unconditionally (enter_dialog skips the capability check) with no plain-text degrade; on a <80×24 or TERM=dumb terminal both the choice and its error popup are unreadable. The popup's own render failure is swallowed (let _ =) with no clog!.
F8Trivialcrates/jackin-runtime/src/isolation/finalize.rs cites a worktree-cleanup-assessment.mdx doc that does not exist; fold the policy table into this item or an internals page and fix the reference.
F9HighA terminal exit leaks its data directory: LoadCleanup::run removes Docker resources but never ~/.jackin/data/<container>/, so every clean exit accumulates a jk-…/ folder. Reaping only ever happens if the operator runs jackin prune instances manually.
F10HighThe instance index row is never removed on a terminal outcome — clean_exited instances linger in instances.json indefinitely, so the index grows without bound and stale rows misrepresent what state still exists.
F11MediumEven the explicit purge path leaks the sibling jk-…​.lock file (orphaned locks with no matching directory are visible on disk), and unrelated *.repo.lock detritus is never swept.
F12MediumStale active rows: instances whose container was removed externally keep status: active in the index, so neither prune nor restore treats them correctly — the index is never reconciled against live Docker state.
F13MediumNo docker start tier: start_container exists in crates/jackin-docker/src/docker_client.rs but is never called, so a stopped-but-present container forces a full pipeline re-run instead of a cheap restart (the highest-fidelity restore is left on the table).
F14HighRestore re-resolves against current config and the role repo's current HEAD instead of the launch-time snapshot. If the role repo advanced or the workspace config changed since launch, restore rebuilds a different image and mount/env shape than the session it claims to restore — a correctness bug for a "finish the work I started" feature.

Proposed flow

On exit

  1. The foreground session ends. The launcher resolves the workspace's cleanup policy (ask / keep / clean).

  2. clean → run today's Cleaned teardown directly, no prompt. keep → skip straight to step 5 with "keep" preselected. ask → continue.

  3. Assess every isolated mount (worktree and clone; shared is skipped). If none is dirty or unpushed, exit normally and clean up automatically — ask does not interrupt a finished, clean session.

  4. If any mount has uncommitted files or unpushed branches, render a rich panel on the launch cockpit surface that shows the detail — the changed files and the ahead-of-upstream branches — and asks the operator to acknowledge.

  5. Present the decision: Return to agent (reconnect now to finish, today's ReturnToAgent), Exit and keep (preserve the worktree/clone and the home, tear down only the container, mark the instance restorable), or Exit and clean up (the operator has seen the unfinished work and insists — run the terminal-cleanup path). Return to agent is the default because it never loses work.

On launch (selecting a role/agent for a workspace)

  1. After the role/agent is chosen, resolve restorable instances for this (workspace, role, agent) — now including kept-clean instances, not only dirty/unpushed ones.

  2. If any exist, present a rich Resume previous session vs Start new choice in the selection surface, each resume candidate labelled with its date, agent, and a dirty/unpushed summary.

  3. Resume rebinds the same instance and walks the restore ladder (see Restore model): reconnect to a still-running container, docker start a stopped one, or recreate a deleted one from the stored launch recipe reusing the same image — rebuilding only when the image is gone. Home, plugins, auth, conversation, and the same worktree/clone return in place; only secret values are re-resolved. Verify-and-acknowledge runs before any worktree/clone is reused.

  4. Start new mints a fresh instance exactly as today. A keep-policy workspace always shows its kept instance here; this is the "always able to restart the previous session from the selection bar" guarantee.

TUI design — screens, flow, and storage

Every screen below renders on the existing rich surface and obeys the canonical TUI Design Decisions: the shared jackin' brand pill + · + screen label on top, the forced-choice select_list (Filter: row over a -marked list, Start … as the default first row), footer-only hints, and an opaque modal backdrop. Mockups use the same light, terminal-native vocabulary as the Launch Progress TUI — no heavy borders, compact labels, bright state words.

Exit flow

session ends (foreground attach returned)

  ├─ policy = clean ───────────────────────────► CLEAN  (no prompt)
  ├─ policy = keep  ───────────────────────────► KEEP   (no prompt)
  └─ policy = ask
        │  assess isolated mounts  (worktree + clone; shared skipped)
        ├─ all clean / all pushed ─────────────► CLEAN  (no prompt)
        └─ any dirty or unpushed
              │  Screen A — Unfinished work (acknowledge)

              Screen B — How to end this session
                 ├─ Return to agent  ────────────► reconnect now → re-assess on next exit
                 ├─ Exit and keep    ────────────► KEEP   (preserve, resume later to finish)
                 └─ Exit and clean up ───────────► CLEAN  (operator insists; discard)

Screen A — Unfinished work (shown only on ask + dirty/unpushed; the verify-and-acknowledge gate):

 jackin'  · session ended

 the-architect (claude) · workspace jackin · jk-sz2v4p0e

 This session has unfinished work in 1 isolated mount.

   worktree  /workspace/jackin
     uncommitted   3 files
       M  src/runtime/launch/mod.rs
       M  src/isolation/finalize/mod.rs
       ?? notes.md
     unpushed      1 branch
       feature/cleanup-flow   2 commits ahead of origin

 Review the above before choosing how to end the session.

Footer hint: Enter continue · Ctrl-C abort.

Screen B — How to end this session (forced-choice):

 jackin'  · session ended

 the-architect (claude) · workspace jackin · jk-sz2v4p0e
 1 isolated mount has unfinished work.

 Filter:
 ▸ Return to agent — keep working in this session now
   Exit and keep — preserve everything, resume later to finish the work
   Exit and clean up — discard the worktree and delete all instance state

Footer hint: ↑/↓ navigate · Enter select · Ctrl-C abort.

Return to agent reconnects to the live session (today's ReturnToAgent) and is the default first row because it never loses work. Exit and keep tears the container down to free Docker resources but retains the host state and marks the instance restorable — it is not a Ctrl-B D detach, which keeps the container running for jackin hardline. Exit and clean up is the deliberate-discard path: the operator has seen the unfinished work on Screen A and insists, so jackin' runs the terminal-cleanup path (see Data-directory and index lifecycle).

Launch flow

operator selects role + agent for a workspace
  │  resolve restorable instances for (workspace, role, agent)
  ├─ none ─────────────────────────────────────► START NEW (fresh instance_id)
  └─ one or more
        │  Screen C — Resume or start new
        ├─ Start new ──────────────────────────► START NEW
        └─ Resume <id>
              │  Screen D — Verify preserved state (acknowledge)

              inspect_container_state(container_base) → restore ladder
                Tier 0  Running          → hardline reconnect
                Tier 1  Stopped/exists    → docker start + reconnect
                Tier 2  NotFound, image   → docker run, reuse stored image_tag
                Tier 3  NotFound, no image → pinned rebuild → Tier 2
              (home, plugins, conversation return in place; secrets re-resolved)

Screen C — Resume or start new (forced-choice; Start new is the default first row per the launch-dialog rule):

 jackin'  · resume or start new

 the-architect (claude) · workspace jackin

 Filter:
 ▸ Start new session
   Resume  jk-sz2v4p0e · 2h ago · clean · ready
   Resume  jk-fme29j3j · 5h ago · 3 files dirty · 1 branch unpushed

Footer hint: ↑/↓ navigate · type to filter · Enter select · Ctrl-C abort.

Screen D — Verify preserved state (shown only when the chosen instance has a worktree/clone mount):

 jackin'  · resume jk-fme29j3j

 Restoring the-architect (claude) · workspace jackin
 Reusing the preserved worktree at /workspace/jackin:

   uncommitted   3 files   (M src/runtime/launch/mod.rs · ?? notes.md · …)
   unpushed      feature/cleanup-flow   2 commits ahead

 Host repo unchanged since this worktree was preserved — safe to reuse.

Footer hint: Enter resume · Esc back · Ctrl-C abort. If the host repo has diverged since preservation, this screen states the conflict and the only safe actions are Esc back or starting new.

Storage layout — the unit that is kept or deleted

~/.jackin/
├── data/
│   ├── instances.json                  index — one row per instance {status, updated_at}
│   ├── instances.json.lock
│   ├── jk-<id>-<ws>-<role>/            ← THE PER-INSTANCE UNIT
│   │   ├── .jackin/
│   │   │   ├── instance.json           manifest (status, sessions, role, agent)
│   │   │   └── isolation.json          mount records (worktree | clone | shared)
│   │   ├── home/                       agent homes — .claude, .claude.json, .codex,
│   │   │                               amp, kimi, opencode  (conversation + plugins)
│   │   ├── claude/ codex/ amp/ …       per-agent auth slots
│   │   └── git/
│   │       ├── worktree/repo/<dst>/<container>/    materialized worktree
│   │       └── clone/repo/<dst>/<container>/       materialized clone
│   └── jk-<id>-<ws>-<role>.lock        per-instance lock (sibling of the dir)
└── sockets/
    └── jk-<id>-<ws>-<role>/            capsule socket dir (separate root)

What is kept vs deleted, per outcome

ArtifactKEEP / resume-laterCLEAN / clean-up / clean policy
Docker container + DinD + certs volume + networkremoved (freed)removed (freed)
data/jk-…/ (home, manifest, isolation, auth, git)keptdeleted
materialized git/worktree or git/clonekeptdeleted
data/jk-….lock siblingkeptdeleted
sockets/jk-…/kept (recreated on resume)deleted
instances.json rowkept, status → restore_availableremoved

The right-hand column is the invariant the current code violates: today everything in it survives a clean exit. The only outcomes that may leave a jk-…/ directory or an index row behind are Keep and a keep-policy exit.

Instance status transitions

                jackin load / console


          ┌────────► active ─────────────────────────────────┐
          │           │  │                                    │
          │           │  └─ clean exit (ask, clean tree)      │ clean-up / clean policy
   resume │           │     · clean-up · clean policy ───────►│
 (rebind  │           │                                       ▼
  same id)│           └─ keep / dirty-keep ──► restore_available     delete fs
          │                                          │               + drop index row
          └──────────────────────────────────────────┘                    │

   crashed · superseded · failed_setup ──► (reaped by the same path) ──►  ⌫ gone

restore_available is the durable "container gone, host state survives" state — reached by an explicit Keep, and also the state a power-off or external docker rm should resolve to once index reconciliation (F12) runs. resume rebinds it back to active against the same identity.

Per-workspace cleanup policy

A new per-workspace setting selects the exit behaviour:

PolicyBehaviour
ask (default)Clean exits clean up automatically; the operator is prompted only when an isolated mount has unfinished (dirty/unpushed) state.
keepAlways retain the instance and its mounts; the workspace is always resumable from the selection bar.
cleanAlways force-clean on exit, no prompt.

This setting lives in the per-workspace file (~/.config/jackin/workspaces/<name>.toml), which is a versioned schema (CURRENT_WORKSPACE_VERSION). The implementing PR must ship the full migration set per the project's pre-release schema rule: a version bump, a WORKSPACE_MIGRATIONS step in crates/jackin-config/src/migrations.rs, a new tests/fixtures/migrations/workspace/from-<predecessor>/ fixture, a re-bake of existing after.toml fixtures, and a Timeline entry in Schema Versions. One version bump for the whole PR.

One-shot CLI overrides (--keep / --clean) should mirror the existing --git-pull / --no-git-pull precedent so a single launch can deviate without editing the workspace file.

Restore model — reuse first, recreate faithfully

Resuming reuses the same instance identity — the same container_base and the surviving ~/.jackin/data/<container>/, never a fresh copy. The goal is narrow and short-lived: get back to exactly the state that was working so the operator can finish the work they started, then clean it up. The state is not meant to live forever. So restore degrades gracefully by how much of the original still exists, and at every tier it reuses whatever is still there before it recreates anything — rebuilding risks landing on a different state (a moved role HEAD, a drifted base image) and breaking the very thing the operator is trying to recover.

The restore ladder

operator chooses Resume <id>
  │  inspect_container_state(container_base)
  ├─ Running          ─► Tier 0  hardline — reconnect to the live session, recreate nothing
  ├─ Stopped / exists ─► Tier 1  docker start + reconnect — same container, same writable layer
  ├─ NotFound, image present ─► Tier 2  docker run reusing the stored image_tag — no build
  └─ NotFound, image gone    ─► Tier 3  rebuild the image pinned to the launch snapshot, then Tier 2
TierCondition (inspect_container_state)ActionWhat returns
0container Runningjackin hardline reconnectthe literal live session — nothing recreated
1container Stopped, still existsdocker start + reconnectthe literal container, full writable layer intact
2NotFound, image presentdocker run reusing the stored image_taga functionally equal container; bind-mounted home/conversation intact
3NotFound, image gonerebuild the image from the pinned inputs, then Tier 2same, after one pinned rebuild

Tier 0 and Tier 1 are the highest fidelity and the common case for the operator's scenario — battery died, machine restarted, docker stop on shutdown — because the container's writable layer is untouched, so everything the agent installed inside it (shell history, oh-my-zsh, ad-hoc tools, anything not bind-mounted) comes back exactly. Tier 1 (docker start) is a gap today (F13): the start_container API exists but is never called, so a stopped container currently forces a full re-run instead of a restart. Tier 2 reuses the persisted image_tag as-is. Tier 3 is the only tier that builds, and it must reproduce the same image — pinned to the role commit and base image recorded at first launch, not the current HEAD (F14).

Pin to the launch snapshot, do not re-resolve

Restore today re-runs the whole pipeline and re-resolves everything against current config and the role repo's current HEAD. For a "finish the work I started" feature that is a correctness bug (F14): a session must come back as it was, not as it would be launched fresh today. Restore must instead replay a launch recipe captured at first launch and stored on the manifest. The manifest already carries image_tag, DockerResources (container/dind/network/volume names), role_source_git, and role_source_ref; the recipe adds the rest.

Recipe fieldStored at launch?On restore
image_tag + the exact role commit SHA it was built fromtag ✅ today · add the pinned SHAreuse the tag; rebuild only at Tier 3, pinned to that SHA
base / construct image referenceaddrebuild against the same base, never latest drift
mount plan — sources, destinations, isolation mode per mountaddre-materialize the same mounts; worktree/clone reuse the preserved tree
env var names and their source refs (op://…, ${env.VAR}, GitHub auth mode)addre-resolve the values fresh
docker run flags / network / DinD shapepartially (DockerResources)replay the same shape
resolved secret values (1Password output, tokens)neverre-asked / re-resolved every restore

Resolved secret values must never be persisted. jackin' already resolves 1Password references, operator env, and tokens fresh on every launch and passes them straight to docker run as env — they are never written to the manifest or data dir, and this design must not regress that. The recipe stores only the reference (op://vault/item/field, the env-var name, the GitHub auth mode); restore re-runs op read and re-resolves tokens with the operator's current access. Caching the resolved value would turn the per-instance data dir into a plaintext secret store — a security regression the design forbids.

Agent state: inside the container vs on the host

Why the ladder prefers reuse: some agent state is bind-mounted to the host and survives docker rm, and some lives only in the container's writable layer and does not.

  • Bind-mounted → survives removal (restored at every tier): the agent homes under /home/agent/ (.claude, .claude.json, .codex, amp, kimi, opencode — conversation history and installed-plugin state), the per-agent auth slots, and /jackin/state, all under ~/.jackin/data/<container>/.
  • Container writable layer → lost on removal (only Tiers 0–1 preserve it; Tiers 2–3 reseed from the image): shell rc and oh-my-zsh, ad-hoc tool installs, and the baked Claude-plugin layer. The derived image reinstalls plugins on rebuild and first-boot seeding repopulates the home defaults, so Tiers 2–3 recover a functionally equivalent container while Tiers 0–1 recover the literal one.

The immutable-snapshot alternative (mint a new id, copy the home + worktree, freeze the original) is deliberately deferred to Session snapshot and rollback, which targets pre-launch host rollback at a heavier disk/identity cost.

Mount verification on exit and restore

The verify-and-acknowledge step is the same logic on both edges (discarding on exit, reusing on resume) and must be a single shared helper, not two parallel copies:

  • Only worktree and clone mounts are checked; shared is exempt because its working tree is the host's own directory and never jackin-managed.
  • The operator sees the concrete evidence — the git status file list and the list of branches ahead of their upstream — not just a count, and acknowledges it before jackin' reuses or discards the tree.
  • clone assessment must be specified and tested in its own right (F4), not inherited implicitly from worktree semantics.

Data-directory and index lifecycle (garbage collection)

A terminal outcome must leave nothing behind; a kept outcome must leave exactly the data needed to resume. This is the inverse guarantee to "keep," and it is non-negotiable — the absence of it is the second half of the operator's instability.

The required invariant, in order, whenever a session ends with clean exit (ask policy, nothing dirty), an explicit clean up now, or the clean policy:

  1. Stamp the instance's terminal status in the manifest and index (so a crash mid-cleanup leaves an honest record).
  2. Remove the per-instance filesystem state: ~/.jackin/data/<container>/, its sibling <container>.lock, and the socket directory ~/.jackin/sockets/<container>/.
  3. Remove the instance's row from ~/.jackin/data/instances.json.

Only keep / resume-later outcomes retain the directory and the index row (with a restorable status). Nothing else should ever leave a jk-…/ folder or an index row on disk.

The mechanism already exists and must be reused, not reimplemented. prune_instances in crates/jackin-runtime/src/runtime/cleanup.rs already reaps CleanExited | Superseded | FailedSetup | Purged by calling purge_container_filesystem (removes the data directory) and InstanceIndex::remove_many (drops the index rows). The defect is purely that this runs only as the manual jackin prune instances command and is never invoked from the exit path. The fix is to call that same removal inline on a terminal outcome — per the project's reuse-before-writing rule, extend/route through the existing purge helpers rather than adding a parallel teardown in LoadCleanup::run.

Two supporting sweeps close the long tail:

  • Lock and detritus reaping (F11). Removing a data directory must also remove its <container>.lock sibling, and a launch-time sweep should drop orphaned jk-…​.lock files with no matching directory and stray *.repo.lock leftovers.
  • Index reconciliation (F12). On launch (or on a jackin prune instances run), reconcile each active row against live Docker state: an instance whose container no longer exists is downgraded to its true terminal status, which makes it eligible for the same reaping path instead of lingering as a false active.

Phases

  • Phase 0 — Exit-dialog hardening (this PR's area). Restore a plain-text degrade when the terminal cannot render the rich surface, clog! the swallowed popup failure, and de-flicker the double alt-screen enter (F7).
  • Phase 1 — Status model and terminal cleanup. Make "keep to resume" a real FinalizeDecision outcome; promote kept instances (including clean ones) to a restore candidate; stop auto-deleting the worktree when the operator keeps (F1, F2). Wire the terminal-outcome cleanup into the exit path so a clean/clean-up/clean exit removes the data directory, sibling lock, socket directory, and the instances.json row by routing through the existing purge_container_filesystem / InstanceIndex::remove_many helpers (F9, F10, F11). instance.json and the index are not versioned schemas, so a new InstanceStatus variant needs no migration.
  • Phase 2 — Rich exit cockpit. Move the exit decision onto the launch-progress surface; show dirty/unpushed detail with acknowledge (F3, F5, F6).
  • Phase 3 — Per-workspace policy. Add the ask / keep / clean setting with its full workspace-schema migration set and the --keep / --clean overrides.
  • Phase 4 — Resume-or-new and the restore ladder. Surface restorable sessions when selecting an agent; rebind the same instance; implement the tier ladder — Tier 0 hardline, Tier 1 docker start (F13), Tier 2 docker run reusing the stored image_tag, Tier 3 pinned rebuild — driven by inspect_container_state; persist the launch recipe on the manifest (pinned role commit SHA, base image reference, mount plan, env-var names + source refs) and replay it instead of re-resolving against current config (F14), re-resolving only secret values; wire verify-and-acknowledge into re-materialisation. The manifest is not a versioned schema, so the new fields need no migration.
  • Phase 5 — Clone parity, GC sweeps, tests, docs. Specify and test clone assessment (F4); add the launch-time lock/detritus reaping and index reconciliation sweeps (F11, F12); update Parallel Agents (operator), Runtime Instance Model (contributor), and TUI Design Decisions; fix the stale doc reference (F8).

Open questions

  • Should a keep-policy workspace cap how many kept instances accumulate per (workspace, role, agent), or prune the oldest automatically? The state is meant to be short-lived, so unbounded keeps that pile up data directories run against the intent.
  • When resuming, should jackin' re-run git_pull_on_entry semantics against the reused worktree, or treat the preserved tree as authoritative and skip the pull? (Leaning authoritative — the point is to return to the exact state, not advance it.)
  • In-container session continuity is now tier-dependent: Tier 0/1 reconnect to the live agent session for free (the container never died), while Tiers 2–3 restore the bind-mounted home/conversation but start a fresh agent process. Should Tiers 2–3 also attempt to resume the agent's own session log (e.g. claude --resume), or is restoring the home enough? This overlaps Console agent session control Phase 4 (session reconciliation) and should be scoped against it.

Cross-references

On this page