Declarative Resource Limits per Agent

Status: Open — design proposal (Phase 1, Agent Orchestrator Research Program)

Problem

jackin’ runs every agent in a Docker container with whatever resource allocation the host gives it. On a developer laptop, six parallel agents can each spawn a cargo build, exhaust memory, OOM-kill the desktop, or starve each other for CPU. There’s no operator-facing control for this today; the operator’s only knob is “launch fewer agents.”

Docker exposes the right primitives (--memory, --cpus, --ulimit nofile=N, plus --memory-reservation for soft limits), but jackin’ doesn’t plumb any of them through.

multicode addresses this directly with three declarative fields: memory-high (soft limit, triggers reclaim), memory-max (hard limit, triggers OOM kill), cpu (quota as percentage), and nofile (FD ceiling on Apple-container backends).

Why It Matters

Parallel agents are unsafe today on resource-constrained hosts. This is a literal correctness gap: a runaway agent can take down the operator’s whole machine.
The autonomous queue (Phase 4) is unusable without it. Five queued agents with default-unlimited memory and CPU is a recipe for OOM kills the moment two of them happen to be running cargo build simultaneously.
Cross-backend resource translation is the right home for this design. Docker, Apple container, and the planned selectable sandbox backends each express limits differently — a declarative layer means each backend translates once.

Inspiration in multicode

Sources:

README — Isolation
Config — config.toml [isolation] block (memory-high, memory-max, cpu)

[isolation]
memory-high = "12 GiB"   # soft limit; triggers cgroup memory.high
memory-max  = "16 GiB"   # hard limit; triggers OOM at this point
cpu         = "300%"     # 3 CPU cores worth of quota
nofile      = 16384      # FD ceiling (Apple container only)

multicode parses these via the size crate (decimal 12 GB and binary 16 GiB both supported), expands shell variables, then maps them onto systemd-run --property MemoryHigh=... etc. — each backend has its own translator, but the config surface is uniform.

multicode also tracks runtime metrics that complement the limits: current RAM, CPU %, and crucially OOM kill count (sampled from systemd memory pressure counter). When an agent gets OOM-killed, the operator sees it.

Recommended Shape

The right level for these fields is the role manifest, not the operator config or workspace config. Reasoning: limits scale with the toolchain (a Rust agent with cargo build needs more headroom than a Go agent), and the role is where toolchain choices live. Operator/ workspace overrides come later if a use case surfaces.

Config

version = "v1alpha2"
dockerfile = "Dockerfile"

[runtime.limits]
memory_high = "12 GiB"     # soft (Docker --memory-reservation)
memory_max  = "16 GiB"     # hard (Docker --memory)
cpus        = "3.0"         # Docker --cpus (string for "1.5", "300%")
nofile      = 16384         # Docker --ulimit nofile=N:N

[runtime.limits.oom]
preserve_state = true       # don't auto-clean an OOM-killed instance
notify         = true       # surface OOM in console (depends on Phase 2 status)

memory_high is optional; absent means same as memory_max. nofile is optional and defaults to host. cpus accepts both fractional and percentage forms ("3.0" and "300%" are equivalent).

CLI override

jackin load <agent> --memory-max 8GiB --cpus 2.0

Operator override is a V1 nicety, not a config-file substitute. Useful for “this one launch is on a smaller machine.”

Backend translation

Each backend implements a ResourceLimits translator:

Docker (today): --memory, --memory-reservation, --cpus, --ulimit nofile=N:N. oom_score_adj if needed for preserve-state.
Apple container (when selectable backends ships): direct per-allocation limits.
systemd-run / bwrap (if it ever lands): cgroups properties.

A backend that can’t honor a declared limit (e.g. nofile on a container runtime that doesn’t expose it) emits a warning at launch and proceeds — not a hard error. Operators see the gap; the agent still runs.

Scope (V1)

[runtime.limits] block on jackin.role.toml with the four fields above.
--memory-max / --memory-high / --cpus / --nofile flags on jackin load for one-shot overrides.
Docker translator only in V1 — that’s the only backend.
size-crate-style parsing (binary and decimal); reuse a small parser rather than pulling the dependency.
Defaults: no limits applied if the field is absent (matches today).
[runtime.limits.oom] block: preserve_state defaults to true, notify defaults to true (no-op until Phase 2 lands).

Defer

Per-workspace overrides. Manifest-level only in V1.
Disk I/O limits (--blkio-weight). Useful but harder to reason about; defer to user request.
Network bandwidth limits. Defer indefinitely.
Auto-pausing OOM-killed agents instead of killing the container. Docker doesn’t expose a clean way; revisit per-backend later.

Open Questions

Should cpus accept percentages explicitly? "3.0" is unambiguous for Docker. "300%" matches multicode but maps awkwardly to Docker (which doesn’t accept the percent sign). Recommended default: accept both at parse time, normalize to fractional core count internally.
Should manifest limits be inheritable across roles? If org/base declares memory_max=16GiB and org/derived extends it, does the derived class inherit? Recommended default: yes, with override semantics — but role inheritance is a separate, larger design question and probably out of scope for V1.
OOM preserve_state interaction with worktree cleanup (the shipped per-branch safety policy described in Per-mount isolation). An OOM-killed instance should always preserve. The cleanup helper already handles non-zero exits; OOM is a special case of that. Confirm the existing helper sees OOM as non-zero.

src/manifest/mod.rs — [runtime.limits] schema
src/manifest/validate.rs — limit value validation
src/runtime/launch.rs — Docker arg construction
src/cli/role.rs — --memory-max etc. flags
New module (e.g. src/runtime/limits.rs) — parser and Docker translator