# Session Keep and Resume (https://jackin.tailrocks.com/reference/roadmap/session-keep-and-resume/)


**Status**: Open — design proposal (Phase 2 — Live operator surface, [Agent Orchestrator Research Program](/reference/roadmap/agent-orchestrator-research/))

## Problem [#problem]

When an operator leaves a jackin' session, the exit path silently discards more than the operator expects, and the next launch cannot bring them back. This is the concrete cause of the "sometimes I lose important context" instability operators report.

Trace a normal clean exit (`Stopped/0`) today: <RepoFile path="crates/jackin-runtime/src/isolation/finalize.rs">crates/jackin-runtime/src/isolation/finalize.rs</RepoFile> assesses each isolated mount, **auto-deletes** any worktree it judges safe (`force_cleanup_isolated`, no prompt), returns `FinalizeDecision::Cleaned`, the launcher tears the container down, and <RepoFile path="crates/jackin-runtime/src/runtime/launch.rs">crates/jackin-runtime/src/runtime/launch.rs</RepoFile> stamps the instance `InstanceStatus::CleanExited`. The per-instance agent home (`~/.jackin/data/<container>/home/.claude`, `.claude.json`, installed plugins, conversation history, every other runtime's home) is preserved on disk — but `CleanExited` is **not** a restore candidate (`is_restore_candidate()` in <RepoFile path="crates/jackin-runtime/src/instance/manifest.rs">crates/jackin-runtime/src/instance/manifest.rs</RepoFile> matches `Active | Running | Crashed | PreservedDirty | PreservedUnpushed | RestoreAvailable | FailedSetup`, never `CleanExited`). So the next `jackin load` / `jackin console` mints a fresh `instance_id`, a fresh container name, a fresh data directory, and an empty home. Yesterday's conversation is stranded on disk with no path back to it.

Two narrower problems compound it:

* **There is no "keep to resume" outcome for a clean session.** Preservation only triggers when an isolated mount has uncommitted or unpushed Git state. A clean session you simply want to continue tomorrow has no button.
* **The exit prompt that does exist is hard to read on degraded terminals.** The rich worktree-cleanup dialog now renders unconditionally (the plain-text fallback was removed), so a too-small or `TERM=dumb` terminal shows an unreadable surface for both the choice and its error popup.

The mirror-image failure is just as damaging: a clean exit &#x2A;*leaves the per-instance data directory and its index entry behind forever.** `LoadCleanup::run` in <RepoFile path="crates/jackin-runtime/src/runtime/launch.rs">crates/jackin-runtime/src/runtime/launch.rs</RepoFile> removes the container, DinD sidecar, certs volume, and network, but never touches `~/.jackin/data/<container>/`, and it deliberately calls `keep_socket_dir()` so even the socket directory lingers. The instance is stamped `clean_exited` in `~/.jackin/data/instances.json` but the row is never removed. The result, observed live, is a data directory holding a dozen `jk-…-jackin-thearchitect/` folders and an `instances.json` still listing seven `clean_exited` instances whose containers are long gone, plus orphaned `.lock` siblings and stray `*.repo.lock` files. Every launch leaks; nothing reaps unless the operator remembers to run `jackin prune instances` by hand. A terminal outcome must clean up after itself — change the status, delete the filesystem state, and drop the index row — in the same exit path, not in a command the operator has to know exists.

The fix is the flow the operator described: a clear, rich exit decision — keep this to resume, or clean it up — governed by a per-workspace policy, and a rich resume-or-new choice at the next launch that rebinds the same persisted instance so the conversation, plugins, and isolated worktree come back as they were.

## Goals [#goals]

* An interactive exit presents a legible, rich choice between **keep to resume later** and **clean up now**, with a graceful plain-text degrade when the terminal cannot render the rich surface.
* A per-workspace **cleanup policy** controls the default: `ask` (today's behaviour — only interrupt on unfinished Git state, clean exits clean up automatically), `keep` (always retain, always resumable from the selection bar), and `clean` (always force-clean, no prompt). The product default is `ask`.
* A kept instance becomes a first-class restore candidate, so the next time the operator selects that workspace + role + agent they are offered **resume previous session** vs **start new**.
* Resuming rebinds the **same instance** — same `container_base`, same `~/.jackin/data/<container>/` — so home, plugins, auth, conversation, and the isolated worktree/clone return in place. A power-off or `docker rm` that removed the container but left the host data directory is fully recoverable.
* Restore **degrades by tier and reuses before it rebuilds**: reconnect to a still-running container, `docker start` a stopped one, recreate a deleted one from the stored launch recipe reusing the same image, and rebuild the image only when it too is gone — pinned to the launch-time snapshot so the operator returns to the state that was working, never a re-resolved different one.
* Restore **never caches resolved secrets**. The launch recipe stores references (`op://…` paths, env-var names, the GitHub auth mode), and 1Password values and tokens are re-asked / re-resolved on every restore.
* Before reusing a clone or worktree mount on resume (and before discarding one on exit), the operator **sees the actual uncommitted files and unpushed branches** and acknowledges them. `shared` mounts are exempt — their data is the host's own and is never jackin-managed.

## Background — most of the plumbing already exists [#background--most-of-the-plumbing-already-exists]

This item is roughly 70% assembly of shipped infrastructure, not greenfield. What already exists:

* **Durable per-instance host state.** `~/.jackin/data/<container>/` survives `docker rm`: `.jackin/instance.json` (the manifest), `.jackin/isolation.json` (mount records), `home/` (every agent runtime's home, including Claude plugins and `.claude.json`), the per-agent auth slots, and the materialised `git/worktree/` and `git/clone/` trees. See [Runtime Instance Model](/reference/runtime/runtime-instance-model/).
* **A container-level restore flow.** `resolve_restore_candidate()` in <RepoFile path="crates/jackin-runtime/src/runtime/launch.rs">crates/jackin-runtime/src/runtime/launch.rs</RepoFile&#x3E; already presents an &#x2A;*"Unfinished jackin instances"** picker (start-fresh vs restore a prior instance, plus related-role recovery), and `InstanceStatus::RestoreAvailable` + `mark_restore_available()` already model "container gone, host state survives." This shipped as **Unique container identity and restore**.
* **Per-mount exit assessment.** `assess_cleanup()` runs `git status --porcelain`, `for-each-ref`, and `rev-list` against every isolation record and fails closed (preserve) on any ambiguity. The `PreservedDirty` / `PreservedUnpushed` distinction already drives per-reason wording.
* **A rich dialog vocabulary.** <RepoFile path="crates/jackin-runtime/src/runtime/progress.rs">crates/jackin-runtime/src/runtime/progress.rs</RepoFile> already exposes `standalone_select_with_context`, `standalone_error_popup`, text prompts, confirms, and the launch cockpit surface the exit and resume dialogs should reuse rather than reinvent.

The gap is the **status model** (clean exits orphan their home; "keep" is not an outcome), the **policy** (not configurable per workspace), and the **operator surface** (the exit/resume decisions are not framed as session continuity, and verification shows counts rather than content).

## Current implementation review — flaws to address [#current-implementation-review--flaws-to-address]

Findings from auditing the exit/restore path and the in-flight rich-dialog change. These are the concrete defects this item should close.

| ID  | Severity | Finding                                                                                                                                                                                                                                                                                                                                                        |
| --- | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| F1  | High     | Clean exit orphans context: `Stopped/0` → worktree auto-deleted → `CleanExited`, which `is_restore_candidate()` excludes, so the preserved home is unreachable on the next launch. &#x2A;*Root cause of lost context.**                                                                                                                                        |
| F2  | High     | No "keep to resume" outcome exists for a clean session; preservation is gated entirely on dirty/unpushed Git state.                                                                                                                                                                                                                                            |
| F3  | Medium   | The restore picker is framed as mess-recovery ("Unfinished jackin instances"), not session continuity ("resume yesterday"), and does not include kept-clean instances.                                                                                                                                                                                         |
| F4  | Medium   | `Clone`-mode mounts are assessed by worktree semantics incidentally (the record field is `worktree_path` but holds the clone dir too); there are no clone-specific tests, and `force_cleanup_clone` is `rm -rf` only.                                                                                                                                          |
| F5  | Medium   | Verification surfaces a path plus "has uncommitted changes" — never the actual file list or unpushed branch list, and offers no acknowledge-the-detail step.                                                                                                                                                                                                   |
| F6  | Low      | The non-interactive preserve path only `eprintln!`s; nothing records whether the operator *wanted* to keep versus jackin' merely *could not ask*.                                                                                                                                                                                                              |
| F7  | Low      | The rich cleanup dialog renders unconditionally (`enter_dialog` skips the capability check) with no plain-text degrade; on a `<80×24` or `TERM=dumb` terminal both the choice and its error popup are unreadable. The popup's own render failure is swallowed (`let _ =`) with no `clog!`.                                                                     |
| F8  | Trivial  | <RepoFile path="crates/jackin-runtime/src/isolation/finalize.rs">crates/jackin-runtime/src/isolation/finalize.rs</RepoFile> cites a `worktree-cleanup-assessment.mdx` doc that does not exist; fold the policy table into this item or an internals page and fix the reference.                                                                                |
| F9  | High     | A terminal exit leaks its data directory: `LoadCleanup::run` removes Docker resources but never `~/.jackin/data/<container>/`, so every clean exit accumulates a `jk-…/` folder. Reaping only ever happens if the operator runs `jackin prune instances` manually.                                                                                             |
| F10 | High     | The instance index row is never removed on a terminal outcome — `clean_exited` instances linger in `instances.json` indefinitely, so the index grows without bound and stale rows misrepresent what state still exists.                                                                                                                                        |
| F11 | Medium   | Even the explicit purge path leaks the sibling `jk-…​.lock` file (orphaned locks with no matching directory are visible on disk), and unrelated `*.repo.lock` detritus is never swept.                                                                                                                                                                         |
| F12 | Medium   | Stale `active` rows: instances whose container was removed externally keep `status: active` in the index, so neither prune nor restore treats them correctly — the index is never reconciled against live Docker state.                                                                                                                                        |
| F13 | Medium   | No `docker start` tier: `start_container` exists in <RepoFile path="crates/jackin-docker/src/docker_client.rs">crates/jackin-docker/src/docker\_client.rs</RepoFile> but is never called, so a stopped-but-present container forces a full pipeline re-run instead of a cheap restart (the highest-fidelity restore is left on the table).                     |
| F14 | High     | Restore re-resolves against *current* config and the role repo's *current* `HEAD` instead of the launch-time snapshot. If the role repo advanced or the workspace config changed since launch, restore rebuilds a **different** image and mount/env shape than the session it claims to restore — a correctness bug for a "finish the work I started" feature. |

## Proposed flow [#proposed-flow]

### On exit [#on-exit]

<Steps>
  1. The foreground session ends. The launcher resolves the workspace's **cleanup policy** (`ask` / `keep` / `clean`).

  2. `clean` → run today's `Cleaned` teardown directly, no prompt. `keep` → skip straight to step 5 with "keep" preselected. `ask` → continue.

  3. Assess every isolated mount (`worktree` and `clone`; `shared` is skipped). If none is dirty or unpushed, exit normally and clean up automatically — `ask` does not interrupt a finished, clean session.

  4. If any mount has uncommitted files or unpushed branches, render a rich panel on the launch cockpit surface that **shows the detail** — the changed files and the ahead-of-upstream branches — and asks the operator to acknowledge.

  5. Present the decision: **Return to agent** (reconnect now to finish, today's `ReturnToAgent`), **Exit and keep** (preserve the worktree/clone and the home, tear down only the container, mark the instance restorable), or **Exit and clean up** (the operator has seen the unfinished work and insists — run the terminal-cleanup path). `Return to agent` is the default because it never loses work.
</Steps>

### On launch (selecting a role/agent for a workspace) [#on-launch-selecting-a-roleagent-for-a-workspace]

<Steps>
  1. After the role/agent is chosen, resolve restorable instances for this `(workspace, role, agent)` — now including kept-clean instances, not only dirty/unpushed ones.

  2. If any exist, present a rich **Resume previous session vs Start new** choice in the selection surface, each resume candidate labelled with its date, agent, and a dirty/unpushed summary.

  3. **Resume** rebinds the same instance and walks the **restore ladder** (see *Restore model*): reconnect to a still-running container, `docker start` a stopped one, or recreate a deleted one from the stored launch recipe reusing the same image — rebuilding only when the image is gone. Home, plugins, auth, conversation, and the same worktree/clone return in place; only secret values are re-resolved. Verify-and-acknowledge runs before any worktree/clone is reused.

  4. **Start new** mints a fresh instance exactly as today. A `keep`-policy workspace always shows its kept instance here; this is the "always able to restart the previous session from the selection bar" guarantee.
</Steps>

## TUI design — screens, flow, and storage [#tui-design--screens-flow-and-storage]

Every screen below renders on the existing rich surface and obeys the canonical [TUI Design Decisions](/reference/tui//): the shared `jackin'` brand pill + `·` + screen label on top, the forced-choice `select_list` (`Filter:` row over a `▸`-marked list, `Start …` as the default first row), footer-only hints, and an opaque modal backdrop. Mockups use the same light, terminal-native vocabulary as the [Launch Progress TUI](/reference/roadmap/launch-progress-tui/) — no heavy borders, compact labels, bright state words.

### Exit flow [#exit-flow]

```text
session ends (foreground attach returned)
  │
  ├─ policy = clean ───────────────────────────► CLEAN  (no prompt)
  ├─ policy = keep  ───────────────────────────► KEEP   (no prompt)
  └─ policy = ask
        │  assess isolated mounts  (worktree + clone; shared skipped)
        ├─ all clean / all pushed ─────────────► CLEAN  (no prompt)
        └─ any dirty or unpushed
              │  Screen A — Unfinished work (acknowledge)
              ▼
              Screen B — How to end this session
                 ├─ Return to agent  ────────────► reconnect now → re-assess on next exit
                 ├─ Exit and keep    ────────────► KEEP   (preserve, resume later to finish)
                 └─ Exit and clean up ───────────► CLEAN  (operator insists; discard)
```

**Screen A — Unfinished work** (shown only on `ask` + dirty/unpushed; the verify-and-acknowledge gate):

```text
 jackin'  · session ended

 the-architect (claude) · workspace jackin · jk-sz2v4p0e

 This session has unfinished work in 1 isolated mount.

   worktree  /workspace/jackin
     uncommitted   3 files
       M  src/runtime/launch/mod.rs
       M  src/isolation/finalize/mod.rs
       ?? notes.md
     unpushed      1 branch
       feature/cleanup-flow   2 commits ahead of origin

 Review the above before choosing how to end the session.
```

Footer hint: `Enter continue · Ctrl-C abort`.

**Screen B — How to end this session** (forced-choice):

```text
 jackin'  · session ended

 the-architect (claude) · workspace jackin · jk-sz2v4p0e
 1 isolated mount has unfinished work.

 Filter:
 ▸ Return to agent — keep working in this session now
   Exit and keep — preserve everything, resume later to finish the work
   Exit and clean up — discard the worktree and delete all instance state
```

Footer hint: `↑/↓ navigate · Enter select · Ctrl-C abort`.

`Return to agent` reconnects to the live session (today's `ReturnToAgent`) and is the default first row because it never loses work. `Exit and keep` tears the container down to free Docker resources but retains the host state and marks the instance restorable — it is **not** a `Ctrl-B D` detach, which keeps the container running for `jackin hardline`. `Exit and clean up` is the deliberate-discard path: the operator has seen the unfinished work on Screen A and insists, so jackin' runs the terminal-cleanup path (see *Data-directory and index lifecycle*).

### Launch flow [#launch-flow]

```text
operator selects role + agent for a workspace
  │  resolve restorable instances for (workspace, role, agent)
  ├─ none ─────────────────────────────────────► START NEW (fresh instance_id)
  └─ one or more
        │  Screen C — Resume or start new
        ├─ Start new ──────────────────────────► START NEW
        └─ Resume <id>
              │  Screen D — Verify preserved state (acknowledge)
              ▼
              inspect_container_state(container_base) → restore ladder
                Tier 0  Running          → hardline reconnect
                Tier 1  Stopped/exists    → docker start + reconnect
                Tier 2  NotFound, image   → docker run, reuse stored image_tag
                Tier 3  NotFound, no image → pinned rebuild → Tier 2
              (home, plugins, conversation return in place; secrets re-resolved)
```

**Screen C — Resume or start new** (forced-choice; `Start new` is the default first row per the launch-dialog rule):

```text
 jackin'  · resume or start new

 the-architect (claude) · workspace jackin

 Filter:
 ▸ Start new session
   Resume  jk-sz2v4p0e · 2h ago · clean · ready
   Resume  jk-fme29j3j · 5h ago · 3 files dirty · 1 branch unpushed
```

Footer hint: `↑/↓ navigate · type to filter · Enter select · Ctrl-C abort`.

**Screen D — Verify preserved state** (shown only when the chosen instance has a worktree/clone mount):

```text
 jackin'  · resume jk-fme29j3j

 Restoring the-architect (claude) · workspace jackin
 Reusing the preserved worktree at /workspace/jackin:

   uncommitted   3 files   (M src/runtime/launch/mod.rs · ?? notes.md · …)
   unpushed      feature/cleanup-flow   2 commits ahead

 Host repo unchanged since this worktree was preserved — safe to reuse.
```

Footer hint: `Enter resume · Esc back · Ctrl-C abort`. If the host repo **has** diverged since preservation, this screen states the conflict and the only safe actions are `Esc back` or starting new.

### Storage layout — the unit that is kept or deleted [#storage-layout--the-unit-that-is-kept-or-deleted]

```text
~/.jackin/
├── data/
│   ├── instances.json                  index — one row per instance {status, updated_at}
│   ├── instances.json.lock
│   ├── jk-<id>-<ws>-<role>/            ← THE PER-INSTANCE UNIT
│   │   ├── .jackin/
│   │   │   ├── instance.json           manifest (status, sessions, role, agent)
│   │   │   └── isolation.json          mount records (worktree | clone | shared)
│   │   ├── home/                       agent homes — .claude, .claude.json, .codex,
│   │   │                               amp, kimi, opencode  (conversation + plugins)
│   │   ├── claude/ codex/ amp/ …       per-agent auth slots
│   │   └── git/
│   │       ├── worktree/repo/<dst>/<container>/    materialized worktree
│   │       └── clone/repo/<dst>/<container>/       materialized clone
│   └── jk-<id>-<ws>-<role>.lock        per-instance lock (sibling of the dir)
└── sockets/
    └── jk-<id>-<ws>-<role>/            capsule socket dir (separate root)
```

### What is kept vs deleted, per outcome [#what-is-kept-vs-deleted-per-outcome]

| Artifact                                            | KEEP / resume-later                    | CLEAN / clean-up / `clean` policy |
| --------------------------------------------------- | -------------------------------------- | --------------------------------- |
| Docker container + DinD + certs volume + network    | removed (freed)                        | removed (freed)                   |
| `data/jk-…/` (home, manifest, isolation, auth, git) | **kept**                               | **deleted**                       |
| materialized `git/worktree` or `git/clone`          | **kept**                               | **deleted**                       |
| `data/jk-….lock` sibling                            | kept                                   | **deleted**                       |
| `sockets/jk-…/`                                     | kept (recreated on resume)             | **deleted**                       |
| `instances.json` row                                | **kept**, status → `restore_available` | **removed**                       |

The right-hand column is the invariant the current code violates: today everything in it survives a clean exit. The only outcomes that may leave a `jk-…/` directory or an index row behind are `Keep` and a `keep`-policy exit.

### Instance status transitions [#instance-status-transitions]

```text
                jackin load / console
                        │
                        ▼
          ┌────────► active ─────────────────────────────────┐
          │           │  │                                    │
          │           │  └─ clean exit (ask, clean tree)      │ clean-up / clean policy
   resume │           │     · clean-up · clean policy ───────►│
 (rebind  │           │                                       ▼
  same id)│           └─ keep / dirty-keep ──► restore_available     delete fs
          │                                          │               + drop index row
          └──────────────────────────────────────────┘                    │
                                                                           ▼
   crashed · superseded · failed_setup ──► (reaped by the same path) ──►  ⌫ gone
```

`restore_available` is the durable "container gone, host state survives" state — reached by an explicit `Keep`, and also the state a power-off or external `docker rm` should resolve to once index reconciliation (F12) runs. `resume` rebinds it back to `active` against the same identity.

## Per-workspace cleanup policy [#per-workspace-cleanup-policy]

A new per-workspace setting selects the exit behaviour:

| Policy          | Behaviour                                                                                                                       |
| --------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| `ask` (default) | Clean exits clean up automatically; the operator is prompted only when an isolated mount has unfinished (dirty/unpushed) state. |
| `keep`          | Always retain the instance and its mounts; the workspace is always resumable from the selection bar.                            |
| `clean`         | Always force-clean on exit, no prompt.                                                                                          |

<Aside type="caution">
  This setting lives in the per-workspace file (`~/.config/jackin/workspaces/<name>.toml`), which is a **versioned schema** (`CURRENT_WORKSPACE_VERSION`). The implementing PR must ship the full migration set per the project's pre-release schema rule: a version bump, a `WORKSPACE_MIGRATIONS` step in <RepoFile path="crates/jackin-config/src/migrations.rs">crates/jackin-config/src/migrations.rs</RepoFile>, a new `tests/fixtures/migrations/workspace/from-<predecessor>/` fixture, a re-bake of existing `after.toml` fixtures, and a Timeline entry in [Schema Versions](/reference/runtime/schema-versions/). One version bump for the whole PR.
</Aside>

One-shot CLI overrides (`--keep` / `--clean`) should mirror the existing `--git-pull` / `--no-git-pull` precedent so a single launch can deviate without editing the workspace file.

## Restore model — reuse first, recreate faithfully [#restore-model--reuse-first-recreate-faithfully]

Resuming **reuses the same instance identity** — the same `container_base` and the surviving `~/.jackin/data/<container>/`, never a fresh copy. The goal is narrow and short-lived: get back to exactly the state that was working so the operator can finish the work they started, then clean it up. The state is not meant to live forever. So restore degrades gracefully by how much of the original still exists, and at every tier it **reuses whatever is still there before it recreates anything** — rebuilding risks landing on a *different* state (a moved role `HEAD`, a drifted base image) and breaking the very thing the operator is trying to recover.

### The restore ladder [#the-restore-ladder]

```text
operator chooses Resume <id>
  │  inspect_container_state(container_base)
  ├─ Running          ─► Tier 0  hardline — reconnect to the live session, recreate nothing
  ├─ Stopped / exists ─► Tier 1  docker start + reconnect — same container, same writable layer
  ├─ NotFound, image present ─► Tier 2  docker run reusing the stored image_tag — no build
  └─ NotFound, image gone    ─► Tier 3  rebuild the image pinned to the launch snapshot, then Tier 2
```

| Tier | Condition (`inspect_container_state`) | Action                                                | What returns                                                          |
| ---- | ------------------------------------- | ----------------------------------------------------- | --------------------------------------------------------------------- |
| 0    | container **Running**                 | `jackin hardline` reconnect                           | the literal live session — nothing recreated                          |
| 1    | container **Stopped**, still exists   | `docker start` + reconnect                            | the literal container, full writable layer intact                     |
| 2    | **NotFound**, image present           | `docker run` reusing the stored `image_tag`           | a functionally equal container; bind-mounted home/conversation intact |
| 3    | **NotFound**, image gone              | rebuild the image from the pinned inputs, then Tier 2 | same, after one pinned rebuild                                        |

Tier 0 and Tier 1 are the highest fidelity and the common case for the operator's scenario — battery died, machine restarted, `docker stop` on shutdown — because the container's writable layer is untouched, so everything the agent installed *inside* it (shell history, `oh-my-zsh`, ad-hoc tools, anything not bind-mounted) comes back exactly. Tier 1 (`docker start`) is a **gap today** (F13): the `start_container` API exists but is never called, so a stopped container currently forces a full re-run instead of a restart. Tier 2 reuses the persisted `image_tag` as-is. Tier 3 is the only tier that builds, and it must reproduce the **same** image — pinned to the role commit and base image recorded at first launch, not the current `HEAD` (F14).

### Pin to the launch snapshot, do not re-resolve [#pin-to-the-launch-snapshot-do-not-re-resolve]

Restore today re-runs the whole pipeline and re-resolves everything against *current* config and the role repo's *current* `HEAD`. For a "finish the work I started" feature that is a correctness bug (F14): a session must come back as it was, not as it would be launched fresh today. Restore must instead replay a **launch recipe** captured at first launch and stored on the manifest. The manifest already carries `image_tag`, `DockerResources` (container/dind/network/volume names), `role_source_git`, and `role_source_ref`; the recipe adds the rest.

| Recipe field                                                                           | Stored at launch?                    | On restore                                                              |
| -------------------------------------------------------------------------------------- | ------------------------------------ | ----------------------------------------------------------------------- |
| `image_tag` + the exact role commit SHA it was built from                              | tag ✅ today · **add** the pinned SHA | reuse the tag; rebuild only at Tier 3, pinned to that SHA               |
| base / construct image reference                                                       | **add**                              | rebuild against the same base, never `latest` drift                     |
| mount plan — sources, destinations, isolation mode per mount                           | **add**                              | re-materialize the same mounts; worktree/clone reuse the preserved tree |
| env var **names** and their **source refs** (`op://…`, `${env.VAR}`, GitHub auth mode) | **add**                              | re-resolve the *values* fresh                                           |
| docker run flags / network / DinD shape                                                | partially (`DockerResources`)        | replay the same shape                                                   |
| **resolved secret values** (1Password output, tokens)                                  | ❌ **never**                          | re-asked / re-resolved every restore                                    |

<Aside type="caution">
  Resolved secret values must never be persisted. jackin' already resolves 1Password references, operator env, and tokens fresh on every launch and passes them straight to `docker run` as env — they are never written to the manifest or data dir, and this design must not regress that. The recipe stores only the **reference** (`op://vault/item/field`, the env-var name, the GitHub auth mode); restore re-runs `op read` and re-resolves tokens with the operator's current access. Caching the resolved value would turn the per-instance data dir into a plaintext secret store — a security regression the design forbids.
</Aside>

### Agent state: inside the container vs on the host [#agent-state-inside-the-container-vs-on-the-host]

Why the ladder prefers reuse: some agent state is bind-mounted to the host and survives `docker rm`, and some lives only in the container's writable layer and does not.

* **Bind-mounted → survives removal** (restored at every tier): the agent homes under `/home/agent/` (`.claude`, `.claude.json`, `.codex`, amp, kimi, opencode — conversation history and installed-plugin state), the per-agent auth slots, and `/jackin/state`, all under `~/.jackin/data/<container>/`.
* **Container writable layer → lost on removal** (only Tiers 0–1 preserve it; Tiers 2–3 reseed from the image): shell rc and `oh-my-zsh`, ad-hoc tool installs, and the baked Claude-plugin layer. The derived image reinstalls plugins on rebuild and first-boot seeding repopulates the home defaults, so Tiers 2–3 recover a *functionally* equivalent container while Tiers 0–1 recover the *literal* one.

The immutable-snapshot alternative (mint a new id, copy the home + worktree, freeze the original) is deliberately deferred to [Session snapshot and rollback](/reference/roadmap/session-snapshot-rollback/), which targets pre-launch host rollback at a heavier disk/identity cost.

## Mount verification on exit and restore [#mount-verification-on-exit-and-restore]

The verify-and-acknowledge step is the same logic on both edges (discarding on exit, reusing on resume) and must be a single shared helper, not two parallel copies:

* Only `worktree` and `clone` mounts are checked; `shared` is exempt because its working tree is the host's own directory and never jackin-managed.
* The operator sees the concrete evidence — the `git status` file list and the list of branches ahead of their upstream — not just a count, and acknowledges it before jackin' reuses or discards the tree.
* `clone` assessment must be specified and tested in its own right (F4), not inherited implicitly from worktree semantics.

## Data-directory and index lifecycle (garbage collection) [#data-directory-and-index-lifecycle-garbage-collection]

A terminal outcome must leave nothing behind; a kept outcome must leave exactly the data needed to resume. This is the inverse guarantee to "keep," and it is non-negotiable — the absence of it is the second half of the operator's instability.

The required invariant, in order, whenever a session ends with **clean exit** (`ask` policy, nothing dirty), an explicit **clean up now*&#x2A;, or the &#x2A;*`clean`** policy:

<Steps>
  1. Stamp the instance's terminal status in the manifest and index (so a crash mid-cleanup leaves an honest record).
  2. Remove the per-instance filesystem state: `~/.jackin/data/<container>/`, its sibling `<container>.lock`, and the socket directory `~/.jackin/sockets/<container>/`.
  3. Remove the instance's row from `~/.jackin/data/instances.json`.
</Steps>

Only **keep / resume-later** outcomes retain the directory and the index row (with a restorable status). Nothing else should ever leave a `jk-…/` folder or an index row on disk.

<Aside type="tip">
  The mechanism already exists and must be reused, not reimplemented. `prune_instances` in <RepoFile path="crates/jackin-runtime/src/runtime/cleanup.rs">crates/jackin-runtime/src/runtime/cleanup.rs</RepoFile> already reaps `CleanExited | Superseded | FailedSetup | Purged` by calling `purge_container_filesystem` (removes the data directory) and `InstanceIndex::remove_many` (drops the index rows). The defect is purely that this runs only as the manual `jackin prune instances` command and is never invoked from the exit path. The fix is to call that same removal inline on a terminal outcome — per the project's reuse-before-writing rule, extend/route through the existing purge helpers rather than adding a parallel teardown in `LoadCleanup::run`.
</Aside>

Two supporting sweeps close the long tail:

* **Lock and detritus reaping (F11).** Removing a data directory must also remove its `<container>.lock` sibling, and a launch-time sweep should drop orphaned `jk-…​.lock` files with no matching directory and stray `*.repo.lock` leftovers.
* **Index reconciliation (F12).** On launch (or on a `jackin prune instances` run), reconcile each `active` row against live Docker state: an instance whose container no longer exists is downgraded to its true terminal status, which makes it eligible for the same reaping path instead of lingering as a false `active`.

## Phases [#phases]

* **Phase 0 — Exit-dialog hardening (this PR's area).** Restore a plain-text degrade when the terminal cannot render the rich surface, `clog!` the swallowed popup failure, and de-flicker the double alt-screen enter (F7).
* **Phase 1 — Status model and terminal cleanup.** Make "keep to resume" a real `FinalizeDecision` outcome; promote kept instances (including clean ones) to a restore candidate; stop auto-deleting the worktree when the operator keeps (F1, F2). Wire the terminal-outcome cleanup into the exit path so a clean/clean-up/`clean` exit removes the data directory, sibling lock, socket directory, and the `instances.json` row by routing through the existing `purge_container_filesystem` / `InstanceIndex::remove_many` helpers (F9, F10, F11). `instance.json` and the index are **not** versioned schemas, so a new `InstanceStatus` variant needs no migration.
* **Phase 2 — Rich exit cockpit.** Move the exit decision onto the launch-progress surface; show dirty/unpushed detail with acknowledge (F3, F5, F6).
* **Phase 3 — Per-workspace policy.** Add the `ask` / `keep` / `clean` setting with its full workspace-schema migration set and the `--keep` / `--clean` overrides.
* **Phase 4 — Resume-or-new and the restore ladder.** Surface restorable sessions when selecting an agent; rebind the same instance; implement the tier ladder — Tier 0 `hardline`, Tier 1 `docker start` (F13), Tier 2 `docker run` reusing the stored `image_tag`, Tier 3 pinned rebuild — driven by `inspect_container_state`; persist the launch recipe on the manifest (pinned role commit SHA, base image reference, mount plan, env-var names + source refs) and replay it instead of re-resolving against current config (F14), re-resolving only secret *values*; wire verify-and-acknowledge into re-materialisation. The manifest is not a versioned schema, so the new fields need no migration.
* **Phase 5 — Clone parity, GC sweeps, tests, docs.** Specify and test clone assessment (F4); add the launch-time lock/detritus reaping and index reconciliation sweeps (F11, F12); update [Parallel Agents](/guides/parallel-agents/) (operator), [Runtime Instance Model](/reference/runtime/runtime-instance-model/) (contributor), and [TUI Design Decisions](/reference/tui//); fix the stale doc reference (F8).

## Open questions [#open-questions]

* Should a `keep`-policy workspace cap how many kept instances accumulate per `(workspace, role, agent)`, or prune the oldest automatically? The state is meant to be short-lived, so unbounded keeps that pile up data directories run against the intent.
* When resuming, should jackin' re-run `git_pull_on_entry` semantics against the reused worktree, or treat the preserved tree as authoritative and skip the pull? (Leaning authoritative — the point is to return to the exact state, not advance it.)
* In-container session continuity is now tier-dependent: Tier 0/1 reconnect to the live agent session for free (the container never died), while Tiers 2–3 restore the bind-mounted home/conversation but start a fresh agent process. Should Tiers 2–3 also attempt to resume the agent's own session log (e.g. `claude --resume`), or is restoring the home enough? This overlaps [Console agent session control](/reference/roadmap/console-agent-session-control/) Phase 4 (session reconciliation) and should be scoped against it.

## Cross-references [#cross-references]

* [Launch Progress TUI](/reference/roadmap/launch-progress-tui/) — the rich surface the exit cockpit and resume picker render into.
* [Console agent session control](/reference/roadmap/console-agent-session-control/) — instance discovery and the selection surface; Phase 4 session reconciliation is the in-container continuity counterpart.
* [Session snapshot and rollback](/reference/roadmap/session-snapshot-rollback/) — the deferred immutable-snapshot recovery model.
* [Runtime Instance Model](/reference/runtime/runtime-instance-model/) — the per-instance host layout this item rebinds on resume.