# Run Diagnostics (https://jackin.tailrocks.com/reference/runtime/diagnostics/)


Every `jackin` command invocation mints a **run ID** before any work begins and writes structured machine-readable events to a JSONL artifact on the host. The run ID is the single shareable artifact that lets an agent reconstruct the full story of a run — what stages executed, which containers started, where each container's capsule log lives, and what crashed — from a single file path.

## Run artifact location [#run-artifact-location]

```
~/.jackin/data/diagnostics/runs/<run-id>.jsonl
```

A run ID is a bare unique value with no prefix: a jackin-minted id is six hex characters (for example, `8b4766`); when a wrapper such as Parallax propagates its run id, jackin adopts that value instead. The host CLI prints the run ID at startup in `--debug` mode.

Up to 200 run artifacts are retained; any artifact older than 30 days is pruned on the next run. Command-specific output logs (from `write_command_output`) are stored as sibling files named `<run-id>.<command-name>.log`.

The file is the **fallback sink**, gated on whether OTLP export is active — not on `--debug`. With no OTLP endpoint configured, the file is written (it is the only durable sink). With OTLP active, the backend is the sink and no file is written unless the operator forces it with `JACKIN_DIAGNOSTICS_FILE=1`. If OTLP is configured but the exporter cannot be built, jackin' falls back to writing the file and surfaces a compact `otlp` notice. When the file is gated off, `RunDiagnostics` still exists (it carries the run id and powers OTLP export and `active_run`) but holds no writer; `RunDiagnostics::persists()` reports whether a file is being written, and the command-output sidecars share the same gate.

Use `jackin diagnostics summary <run-id>` to print the slowest broad stages, nested timings, build-context sizes, Docker build steps, and cache decisions for a run without hand-parsing the JSONL file.

## JSONL event schema [#jsonl-event-schema]

Every line in the run artifact is a JSON object with the following fields (all required unless marked optional):

| Field      | Type           | Description                                                                                                   |
| ---------- | -------------- | ------------------------------------------------------------------------------------------------------------- |
| `ts_ms`    | integer        | Unix timestamp in milliseconds                                                                                |
| `run_id`   | string         | The run ID this event belongs to                                                                              |
| `trace_id` | string         | Same as `run_id` for now; reserved for future distributed trace correlation                                   |
| `span_id`  | string \| null | Optional tracing span identifier when the event is emitted inside an active span; `null` outside span context |
| `kind`     | string         | Event kind — see table below                                                                                  |
| `message`  | string         | Human-readable summary of the event                                                                           |
| `stage`    | string \| null | Optional launch-stage name for `stage_*` events                                                               |
| `detail`   | string \| null | Optional extra JSON payload — contents vary by `kind`                                                         |

**This schema is a contract.** The `--debug` triage workflow and agent-readable post-mortem analysis both depend on these field names. Renaming or removing fields requires a deliberate versioning decision.

## Event kinds [#event-kinds]

| `kind`                 | When emitted                                                           | `detail` payload                                                                                                                                                  |
| ---------------------- | ---------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `run`                  | At command start                                                       | none                                                                                                                                                              |
| `compact`              | Lifecycle breadcrumbs, action summaries                                | plain text or none                                                                                                                                                |
| `stage_started`        | A launch stage begins                                                  | none                                                                                                                                                              |
| `stage_done`           | A launch stage completes                                               | `{"duration_ms": N}` — wall-clock ms for that stage                                                                                                               |
| `run_summary`          | End of run                                                             | `{"stage_durations_ms": {...}, "stage_duration_histograms_ms": {...}, "event_counts": {...}, "cache_hits": N, "cache_misses": N}`                                 |
| `launch_plan`          | Restore/launch planning selected a foreground plan                     | `{"plan":"AttachExisting","reason":"...","container":"..."}` where `plan` is one of `AttachExisting`, `StartStopped`, `CreateFromValidImage`, or `BuildAndCreate` |
| `launch_plan_rejected` | Restore/launch planning rejected a faster plan                         | `{"plan":"...","reason":"...","container":"...","state":"..."}`                                                                                                   |
| `debug`                | `--debug` mode only; detailed trace lines                              | category string                                                                                                                                                   |
| `container_started`    | After `docker run -d` succeeds                                         | `{"container_name": "...", "capsule_log": "/path/to/multiplexer.log"}`                                                                                            |
| `container_exited`     | Container exited with code 0 after attach                              | `{"container_name": "...", "exit_code": 0, "oom_killed": false, "capsule_log": "..."}`                                                                            |
| `container_crash`      | Container exited non-zero or OOM                                       | same shape as `container_exited` with non-zero `exit_code` or `oom_killed: true`                                                                                  |
| `container_crash_log`  | Emitted alongside `container_crash` when docker logs tail is available | docker logs last lines as plain string                                                                                                                            |

## Container Lifecycle Tracking [#container-lifecycle-tracking]

A single run may start multiple containers — the role container plus a Docker-in-Docker sidecar. The `container_started` event records the container name and the **host path of the in-container capsule diagnostics log** (`~/.jackin/data/<container-name>/state/multiplexer.log`) so an agent reading the run JSONL can locate the per-container crash log without knowing the on-disk layout.

Example flow after a container crash:

1. The run JSONL records `container_started` with `"capsule_log": "~/.jackin/data/jk-xxxx-thearchitect/state/multiplexer.log"`.
2. The container exits non-zero → `container_crash` event with the exit code.
3. The caller fetches the last lines of `docker logs` → `container_crash_log` event with the evidence inline.
4. An agent reading the run JSONL has the full story: which container, its exit code, the docker log tail, **and the path to the capsule's own panic backtrace** in `multiplexer.log`.

This means sharing the run ID with an agent is sufficient to locate the root cause, even if the crash produced no host-side output.

## Stage Timings And Metrics [#stage-timings-and-metrics]

The `stage_done` event's `detail` always includes `duration_ms` — the wall-clock milliseconds between the corresponding `stage_started` and `stage_done` for the same stage name. At the end of every run, a single `run_summary` event records all stage durations as a compact JSON map:

```json
{"stage_done": ..., "detail": "{\"duration_ms\": 312}"}
...
{"kind": "run_summary", "detail": "{\"stage_durations_ms\":{\"identity\":45,\"agent-binaries\":312},\"stage_duration_histograms_ms\":{\"identity\":[45],\"agent-binaries\":[312]},\"event_counts\":{\"stage_done\":2},\"cache_hits\":0,\"cache_misses\":0}"}
```

This makes the run JSONL the primary source for spotting performance bottlenecks and correlating blocking stalls with async runtime work. The run summary also records event counts, stage-duration histograms, and cache hit/miss counters so metric evidence stays attached to the run ID instead of disappearing into a separate local log.

Launch diagnostics include nested credential timings such as `operator_env:<KEY>`, `github_env:<KEY>`, `role_state_prepare:github_auth`, and `role_state_prepare:<agent>_auth`. Details name value kinds or auth modes/outcomes only; resolved secret values are not written.

Launch-plan diagnostics explain why the foreground path chose a given repair level. For example, a missing current-role container records rejected `AttachExisting` / `StartStopped` plans and a selected `CreateFromValidImage` plan before the image decision proves whether that path can reuse a local image or must build. When launch reuses a valid local image but a refresh is still due, `CreateFromValidImage` keeps the restore reason and appends the image reason, such as `no_restore_candidate_valid_image:published_image_stale`.

Image cache diagnostics explain each derived-image decision. A valid local image emits `image_cache_hit` with `recipe_hash_match`; a rebuild emits `image_cache_miss` with the invalidation reason; a foreground-valid image with stale refresh inputs emits `image_refresh_background` so summaries show that launch reused the image while refresh work remains. Current reasons include broad failures such as `local_image_missing`, `missing_recipe_label`, `published_image_stale`, or `inspect_failed`, plus component-level recipe changes such as `construct_image_changed`, `base_image_changed`, `hooks_hash_changed`, `claude_plugin_recipe_changed`, and `host_identity_strategy_changed`.

Cold rebuild diagnostics also include `build_context_snapshot` after jackin' creates the immutable Docker build context. The event records file count, byte count, and the temporary context path so slow build-context creation can be separated from Docker build time.

## Implementation [#implementation]

The diagnostics system lives in <RepoFile path="crates/jackin-diagnostics/src/run.rs">crates/jackin-diagnostics/src/run.rs</RepoFile> (`RunDiagnostics` struct). One instance is held per process in a `OnceLock<Mutex<Option<Arc<RunDiagnostics>>>>`. The file sink is an `Option<Mutex<BufWriter<File>>>` — `None` when the file is gated off; when present, each event is serialized via `serde_json` and flushed immediately. Event counters always update (they feed the run summary, which OTLP also exports); only the JSONL write is skipped when the writer is absent.

`RunDiagnostics` methods:

* `compact(kind, message)` — lifecycle breadcrumb; written when the file sink is on.
* `stage(kind, stage, message, detail)` — stage event with optional detail; tracks `stage_started` / `stage_done` wall-clock timings.
* `debug(category, line)` — written only when `--debug` is active.
* `container_started(container_name, capsule_log_path)` — structured container lifecycle event.
* `container_exited(container_name, exit_code, oom_killed, capsule_log_path, crash_evidence)` — structured crash/exit event; emits an additional `container_crash_log` event when `crash_evidence` is `Some`.
* `emit_run_summary()` — writes the `run_summary` event with all accumulated stage durations.
* `summarize_run_file(path)` — reads an existing JSONL artifact and derives the operator-facing `jackin diagnostics summary` report.

Every `RunDiagnostics` write also emits a `tracing::info!` or `tracing::debug!` event (`kind: "debug"` events use DEBUG severity so level-based exporters can filter the firehose). `JackinDiagnosticsLayer` in <RepoFile path="crates/jackin-diagnostics/src/observability.rs">crates/jackin-diagnostics/src/observability.rs</RepoFile> bridges those tracing events back into the run JSONL schema, preserving span context when one is active. The same layer also captures the OpenTelemetry SDK's own `tracing` events (targets starting `opentelemetry`) at WARN and above and records them as `otlp_internal` — export failures the OTLP layers themselves cannot surface (they are filtered out of the log bridge to avoid an export→log→export feedback loop). The default build installs no terminal subscriber.

## OTLP export [#otlp-export]

jackin' exports OTLP over **gRPC only** — the reference backend's (Parallax) default and what `parallax run` injects. With the `otlp` feature (default for the `jackin` binary) and an endpoint configured via the standard OpenTelemetry variables — `OTEL_EXPORTER_OTLP_ENDPOINT` for a base every signal derives from, or the per-signal `OTEL_EXPORTER_OTLP_{TRACES,LOGS,METRICS}_ENDPOINT` overrides — `init_tracing` installs span, log, and metric export beside the JSONL layer, with the gRPC channel target set to the endpoint verbatim (gRPC routes by service name, so no `/v1/<signal>` path is appended). A non-grpc `OTEL_EXPORTER_OTLP_PROTOCOL` (or per-signal variant) while an endpoint is configured is rejected at startup as a fatal `E016` rather than mis-sent (`first_unsupported_protocol` → `JackinError::UnsupportedOtlpProtocol`):

* **Spans** via `tracing-opentelemetry`: every `launch_stage` tracing span exports with its wall-clock duration and `stage` attribute, so per-stage timings render as a trace waterfall in the backend.
* **Logs** via `opentelemetry-appender-tracing`: every tracing event becomes a log record — the JSONL event stream (with `kind`/`stage`/`detail` as attributes and the message as the body) plus third-party crate telemetry such as bollard's request traces. The layer filter follows the debug flag: INFO level normally, DEBUG with `--debug` — the same two-tier rule as the rest of the telemetry surface.
* **Metrics** via an async-runtime `PeriodicReader` (5 s interval) over gRPC: `process.cpu.utilization` (a gauge, unit `1`; sysinfo per-core percent normalized to a 0..1 fraction) and `process.memory.usage` (an UpDownCounter per semconv, resident bytes), plus the stable tokio runtime counters `tokio.runtime.workers`, `tokio.runtime.alive.tasks`, and `tokio.runtime.global.queue.depth`. CPU and memory read through one shared sampler that refreshes sysinfo at most once per collect cycle — refreshing per-instrument would measure CPU over the microseconds between callbacks. The runtime gauges read jackin's **app** runtime handle, captured before entering the dedicated telemetry runtime (see below) — capturing it later would read the telemetry runtime, and reading it from the collect thread would yield none. Metric init is best-effort: a failed exporter build logs at DEBUG and never blocks span/log export or the run.

The export filter is scoped to jackin's own telemetry: the directive silences the OTLP transport stack (`hyper`/`h2`/`tower`/`tonic`/`reqwest`/`opentelemetry*`) so the log bridge cannot re-export the exporter's own request logs — a feedback loop under `--debug` — and the backend is not flooded with dependency-internal spans. The silencing applies only to the export layers; `JackinDiagnosticsLayer` is unfiltered, so it still captures `opentelemetry*` WARN+ as `otlp_internal` and a failed export stays visible in the run file.

### Tag taxonomy [#tag-taxonomy]

All attribute keys live in `otel_keys` in <RepoFile path="crates/jackin-diagnostics/src/observability.rs">crates/jackin-diagnostics/src/observability.rs</RepoFile> — one source of truth. Every key is dotted, never underscored: jackin's own keys are `jackin.component`, `jackin.screen.name`, `jackin.workspace`, `jackin.agent.selected`, …; the run id uses the `parallax.run.id` key (Parallax is the reference backend, which promotes it to a queryable column — `parallax logs --run <id>`); `service.*` and `session.*` reuse the OpenTelemetry standard namespaces. There is no separate `jackin.run.id` — one dotted key groups the run. The OTLP resource carries `service.name=jackin`, `service.version`, `jackin.component=host`, and `parallax.run.id` — omitted when a wrapper already supplies it via `OTEL_RESOURCE_ATTRIBUTES` (then the wrapper's value wins and the env detector provides it). Because the run id must be on the resource, `RunDiagnostics::start` mints it **before** installing the subscriber.

### Per-screen traces and span links [#per-screen-traces-and-span-links]

`screen.rs` (<RepoFile path="crates/jackin-diagnostics/src/screen.rs">crates/jackin-diagnostics/src/screen.rs</RepoFile>) models each TUI screen as its own trace. `enter_screen` starts a span, detaches it into a fresh trace with `set_parent(Context::new())`, and `add_link`s the previous screen's `SpanContext` — so screens are separate but navigable. The current screen is a thread-local (sound: host TUI navigation is single-threaded); `carry_link_forward` snapshots it across the `run_console` return so the launch trace links to the list it started from. `launch_trace` enters the `launch` screen, tags workspace/agent/provider, and `future.instrument`s `load_role` so the existing `launch_stage` spans re-root under the launch trace with no change to `jackin-launch`.

### Cross-process propagation [#cross-process-propagation]

When OTLP is active, `launch.rs` injects `TRACEPARENT` (the launch span, via `current_traceparent`), a container-reachable `OTEL_EXPORTER_OTLP_ENDPOINT` (`container_otlp` rewrites a loopback authority to `host.docker.internal` and flags `--add-host=host.docker.internal:host-gateway`), and `JACKIN_RUN_ID`. The capsule daemon (<RepoFile path="crates/jackin-capsule/src/telemetry.rs">crates/jackin-capsule/src/telemetry.rs</RepoFile>) calls `init_capsule_tracing`, which stamps `jackin.component=capsule`, a minted standard `session.id`, and the host `parallax.run.id`, and emits a session-start span linked to the launch via the parsed traceparent. The capsule's `clog!`/`cdebug!` lines bridge into OTLP logs (INFO/DEBUG by tier) correlated by `session.id`. The session exports per activity rather than under one long-lived span, so a SIGKILL only loses the in-flight tail.

### Exporters and lifecycle [#exporters-and-lifecycle]

The gRPC/tonic exporter is async, so it cannot run on the SDK's stable dedicated-thread batch processors (they `block_on` the export off any reactor and panic / hang); it requires the async-runtime batch processors driven by a tokio runtime. Both the host and capsule use a current-thread tokio main, where plain `rt-tokio` deadlocks on flush (the `futures_executor::block_on` flush parks the only thread) and `rt-tokio-current-thread`'s isolated per-spawn runtime cannot drive tonic's h2 connection. jackin' therefore owns a **dedicated multi-thread telemetry runtime** (`otel_runtime`, one worker, held for the process lifetime) and builds every exporter/processor/reader inside its `enter()` guard so their workers — and tonic's connection driver — spawn onto it; the flush then parks the main thread while those worker threads complete the export. On the host, `ActiveRunGuard::drop` calls `shutdown_otlp` to force-flush all providers, so the tail flushes on **every** run exit — including `?` error early-returns — not just the success path; the capsule holds an equivalent `FlushGuard` for the daemon's lifetime. Without an endpoint configured the layers and the flush are no-ops, and `--no-default-features` removes the dependency entirely.

The operator-facing setup lives in the [Run Telemetry guide](/guides/run-telemetry/).

**The firehose never reaches the operator's screen, in either the full-screen TUI or plain CLI commands.** It flows only to the active sink (the run JSONL and/or OTLP export); `--debug` raises the captured detail level, it does not stream events to stderr. This is why `init_tracing` attaches no `fmt` layer: a layer writing to stdout/stderr would corrupt the alternate screen the console / launch cockpit owns, and would clutter ordinary CLI output. Operator-visible lifecycle lines are a separate, deliberately compact surface (`emit_compact_line`): printed to stderr on a plain CLI, but when a rich surface owns the screen they are *deferred* into the debug buffer and flushed at teardown rather than dropped — so a notice such as a failed export reaches the operator (and any wrapping parent process) without spewing over the cockpit, and without depending on the optional run file.

## Redaction [#redaction]

`container_name` and `capsule_log` are paths and identifiers, not secrets. The docker log tail (`crash_evidence`) is freeform text from the container process; env-var values that the agent printed to stdout/stderr may appear there. No filtering is applied — the assumption is that the operator who runs `jackin load` trusts the agent output they requested.

## See also [#see-also]

* <RepoFile path="crates/jackin-diagnostics/src/run.rs">crates/jackin-diagnostics/src/run.rs</RepoFile> — `RunDiagnostics` implementation
* <RepoFile path="crates/jackin-diagnostics/src/observability.rs">crates/jackin-diagnostics/src/observability.rs</RepoFile> — `tracing` subscriber initialization
* [Capsule debug crash triage](/reference/capsule/) — how to use the run JSONL + `multiplexer.log` pointer to trace a container crash