Source: DESIGN-coord-spawn-atomic-gate.md — Rendered: 2026-05-19 14:58 UTC — Agents read the .md; humans read the .html.

DESIGN: Atomic Coord-Spawn Gate¶

Status: v3 — operator approved 2026-05-19 (G2 cleared on v3). v2 bundled handoffop drain-path RC marker backfill; v3 adds Change 7 — extend the coord-spawn lock to also cover handoffop drain-path spawn (closes the second half of the race that codex iter-1 surfaced in v2 review). Ready to re-dispatch. Author: coord agent 5f05977b (projects-fleet) Reviewers: operator (G2); codex + /review at PR time per CLAUDE.md §4. Target version: v0.12.2 (point release on top of v0.12.1). Created: 2026-05-19 Priority: P0 — duplicate coords observed in production (projects-spark: agents 72ea51b4 + 04f00601); operator: "we don't allow this." "only one coord for sure." Plus second observed P0: handoff replacement doesn't carry --remote-control (PR #163 fixes future fresh spawns but the deferred [P2] backfill case for pre-#163 coord handoffs remains). Parent: docs/DESIGN-rc-listener-lifecycle.md v8 (coord lifecycle ownership). Sibling of docs/DESIGN-rc-coord-auto-marker.md (v0.12.1 P0, PR #163). Bundles (per operator memory feedback_bundle_related_prs): atomic spawn gate + handoffop drain-path RC backfill (PR #163 deferred [P2] (2) "handoffop missing marker write for coord-handoff backfill"). Does NOT change: the coord skill's coordinator.lock (skill-owned post-spawn supervision lock — different concern), the TUI [a] keybinding flow, rc.Enabled semantics, or the worker dispatch path.

TL;DR¶

fleet dispatch --coord-spawn has a TOCTOU race. The veto at cmd/fleet/dispatch.go:522 reads coordRecordExistsInList against a pre-fetched agent-records snapshot; two concurrent dispatches both see "no existing coord" and both spawn. Confirmed against projects-spark today: - 72ea51b4 — has last_handoff_path, came from a handoff replacement. - 04f00601 — no handoff path, fresh [a] press from TUI mid-handoff.

Both are alive, both think they're the coord, both compete for the skill-owned coordinator.lock (which serializes ONLY supervisor ticks, not spawn).

This design adds an OS-level NB-flock at the dispatch CLI layer (coord-spawn.lock) that serializes the entire (veto-check, agent-record-write, spawn-return) tuple for a given project. Second-into-the-lock fails immediately with a clean error; first acquires + spawns + releases.

rc.Enabled and the existing veto stay intact. The lock is a new chokepoint, not a replacement.

Problem reproduction¶

$ ls ~/.fleet/agents/ | xargs -I{} cat ~/.fleet/agents/{} | jq -r 'select(.project=="projects-spark") | "\(.id) pid=\(.pid) handoff=\(.last_handoff_path)"'
04f00601 pid=79493 handoff=null
72ea51b4 pid=78455 handoff=/Users/pinkbear/.fleet/handoffs/49bc8a03-20260519-064913-3c66.md

Both alive (both PIDs respond to kill -0), both have distinct tmux sessions, both have agent records, both are running claude --dangerously-skip-permissions. The skill's coordinator.lock only blocks duplicate ticks; nothing blocks duplicate SPAWN.

Root cause (race trace)¶

Operator triggers handoff on 49bc8a03. Coord writes handoff doc, exits.
Handoff infrastructure starts spawning replacement (will become 72ea51b4).
Spawn is slow (fleet rc-listener-bootstrap-sk-3e98 + claude --dangerously-skip-permissions takes seconds; tmux session not yet alive).
Operator sees no live coord in TUI dashboard, presses [a] to attach.
TUI's [a] handler runs findExistingCoordForProject(m.records, "projects-spark"): - m.records is a stale snapshot from the last refresh tick. The handoff replacement hasn't registered yet. - Returns (nil, false) — falls through to startCoordSpawn.
TUI shells out fleet dispatch coord-projects-spark --project projects-spark --coord-spawn.
CLI's runDispatch reads agent-records list, hits veto at line 522: - coordStateFresh("projects-spark") may be FALSE (the handoff replacement hasn't ticked coord-state.json yet). - coordRecordExistsInList returns FALSE (handoff replacement hasn't written its record yet). - Veto SKIPPED.
runDispatch proceeds to write a fresh agent record (04f00601), spawn tmux session.
Meanwhile, the handoff replacement (72ea51b4) completes its own spawn-and-record-write.
Two coords, same project, both alive. Skill-side coordinator.lock then arbitrates ticks (the loser sees "lock busy" and exits cleanly), so they don't trample each other's state — but the OPERATOR sees a dashboard with two rows for the same project and both panes are responsive. Confusing + wasteful + violates the "one coord per project" invariant.

The race window is the gap between (handoff spawn started) and (handoff spawn registered + tmux session alive). On a busy mac the window is multi-second; on a fast machine still hundreds of milliseconds — enough for a fast double-press of [a] to race.

Design decision¶

Add an NB-flock at ~/.fleet/projects/<p>/.locks/coord-spawn.lock held across the entire coord-spawn critical section in runDispatch.

Acquired at the top of runDispatch when opts.coordSpawn == true (before any check or veto).
Mode: LOCK_EX | LOCK_NB. Held by the first dispatcher to acquire; second-into-the-lock fails immediately with a clean error.
Released via defer on function return — both success path and error path.
Lock file lives in .locks/ next to the existing coordinator.lock (skill-owned), state.lock (state-package owned), and per-worker worker-*.lock files. Sibling, not a replacement.
Operator-facing error on contention: refusing to spawn coord for project %q: another coord-spawn is in progress (lock pid: <pid>). attach to the in-flight coord via TUI [a] once it finishes booting.

Why NB-flock (not blocking with timeout): - Operator double-presses are visible: the second [a] press surfaces an explicit "another spawn in progress" message rather than a silent wait → the operator learns the correct mental model ("don't double-press; wait or attach"). - The first dispatcher finishes in <5s for happy-path spawns. A blocking flock would either timeout (re-introducing race) or sit for unbounded time on a stuck spawn. - Matches existing rc.go (internal/rc/rc.go:withLock) and skill coordinator.lock conventions — both NB-flock.

Why keep the existing veto at line 522: - Defense in depth. The veto catches cross-tmux-socket and cross-host cases (where the lock file might not be reachable, e.g., ~/.fleet/ mounted differently). Vanishingly rare today but not zero. - The veto's error message is also more informative for the "live coord, different socket" failure mode than a bare flock-contention error.

Implementation¶

Change 1 — `cmd/fleet/dispatch.go`¶

At the top of runDispatch, after opts.coordSpawn validation but BEFORE any other check (definitely before line 522):

// v0.12.2 P0: atomic coord-spawn gate. NB-flock per-project closes
// the TOCTOU race between two concurrent --coord-spawn dispatches
// (TUI double-press during handoff was producing duplicate coords —
// e.g. projects-spark agents 72ea51b4 + 04f00601 on 2026-05-19).
// Acquired BEFORE the veto/records-list read; held through agent
// record write + spawn invocation; released on runDispatch return.
//
// The existing live-coord veto at line ~522 stays — defense in depth
// for cross-socket / cross-host cases the flock doesn't reach.
if opts.coordSpawn && opts.project != "" {
    release, err := acquireCoordSpawnLock(opts.project)
    if err != nil {
        return err  // operator-friendly message inside acquireCoordSpawnLock
    }
    defer release()
}

Change 2 — new helper `cmd/fleet/coord_spawn_lock.go`¶

package main

import (
    "fmt"
    "os"
    "path/filepath"
    "syscall"

    "github.com/edisonshen/fleet/internal/state"
)

// acquireCoordSpawnLock takes an NB-flock on
// ~/.fleet/projects/<project>/.locks/coord-spawn.lock. Returns a
// release function on success. On contention, returns a clear
// operator-facing error.
//
// Naming: "coord-spawn.lock" — distinct from the skill-owned
// "coordinator.lock" (post-spawn supervisor lock) and the
// state-package "state.lock". This file gates SPAWN only.
func acquireCoordSpawnLock(project string) (release func(), err error) {
    pdir, err := state.ProjectDir(project)
    if err != nil {
        return nil, fmt.Errorf("acquireCoordSpawnLock: project dir: %w", err)
    }
    lockDir := filepath.Join(pdir, ".locks")
    if err := os.MkdirAll(lockDir, 0o755); err != nil {
        return nil, fmt.Errorf("acquireCoordSpawnLock: mkdir locks: %w", err)
    }
    lockPath := filepath.Join(lockDir, "coord-spawn.lock")
    f, err := os.OpenFile(lockPath, os.O_RDWR|os.O_CREATE, 0o644)
    if err != nil {
        return nil, fmt.Errorf("acquireCoordSpawnLock: open: %w", err)
    }
    if err := syscall.Flock(int(f.Fd()), syscall.LOCK_EX|syscall.LOCK_NB); err != nil {
        _ = f.Close()
        if err == syscall.EWOULDBLOCK {
            // Read current holder for the error message — best-effort.
            holder := ""
            if b, rerr := os.ReadFile(lockPath); rerr == nil {
                holder = string(b)
            }
            return nil, fmt.Errorf(
                "refusing to spawn coord for project %q: another coord-spawn is in progress%s. "+
                    "attach to the in-flight coord via TUI [a] once it finishes booting",
                project,
                func() string {
                    if holder != "" {
                        return fmt.Sprintf(" (lock holder: %s)", holder)
                    }
                    return ""
                }(),
            )
        }
        return nil, fmt.Errorf("acquireCoordSpawnLock: flock: %w", err)
    }
    // Write holder body (PID-based; best-effort, not load-bearing).
    _ = f.Truncate(0)
    _, _ = fmt.Fprintf(f, "pid=%d\n", os.Getpid())
    return func() {
        _ = syscall.Flock(int(f.Fd()), syscall.LOCK_UN)
        _ = f.Close()
    }, nil
}

Change 3 — no changes to TUI¶

The TUI's [a] handler keeps its existing findExistingCoordForProject fast-path. On contention, the CLI's flock failure surfaces back as a flash message via the existing dispatch-failure path. No new TUI lock; CLI is the chokepoint.

Change 4 — no changes to `internal/rc`, `internal/spawn`, `internal/handoff/`¶

The lock is dispatch-layer concern. None of those packages need to know.

E2E test¶

New test file: cmd/fleet/dispatch_coord_spawn_lock_test.go.

T1 — second concurrent coord-spawn for same project fails fast¶

func TestCoordSpawn_SecondConcurrentFails(t *testing.T) {
    fleetHome := t.TempDir()
    t.Setenv("FLEET_HOME", fleetHome)
    project := "test-spawn-race"
    // Goroutine 1: acquire and HOLD the lock (simulates first dispatch in-flight).
    release, err := acquireCoordSpawnLock(project)
    if err != nil { t.Fatalf("first acquire: %v", err) }
    defer release()
    // Goroutine 2: attempt second acquire — should fail with EWOULDBLOCK + clean error.
    _, err = acquireCoordSpawnLock(project)
    if err == nil {
        t.Fatal("second acquire succeeded — race not closed")
    }
    if !strings.Contains(err.Error(), "another coord-spawn is in progress") {
        t.Fatalf("error message drift: %v", err)
    }
}

T2 — lock released on function return; subsequent acquire OK¶

func TestCoordSpawn_LockReleasedOnReturn(t *testing.T) {
    // first: acquire + release.
    // second: acquire — should succeed.
    // Verifies defer release() actually runs.
}

T3 — different projects don't block each other¶

func TestCoordSpawn_PerProjectLockIsolation(t *testing.T) {
    // acquire("proj-a") + hold.
    // acquire("proj-b") — should succeed (independent lock files).
}

T4 — empty project (legacy path) skips the lock¶

func TestCoordSpawn_EmptyProjectSkipsLock(t *testing.T) {
    // The runDispatch guard is `opts.coordSpawn && opts.project != ""`.
    // Test that with opts.project="" the spawn proceeds without acquiring.
    // (Worker-dispatch path; coord-spawn requires a project per validateOpts.)
    // This test exercises the helper-level branch.
}

T5 — operator-facing error message includes holder PID¶

func TestCoordSpawn_ContentionErrorIncludesHolder(t *testing.T) {
    // First acquire writes "pid=N\n" to the lock file.
    // Second acquire reads it back into the error message.
    // Verify pid format in error.
}

T6 — concurrent runDispatch invocations: only one spawns¶

Integration test (heavier — may be a build-tag-gated integration_test.go):

func TestRunDispatch_Concurrent_OnlyOneSpawns(t *testing.T) {
    // Two goroutines invoke runDispatch with --coord-spawn for the same project.
    // Assert: exactly ONE writes an agent record; the other returns the lock error.
    // Use a synchronization barrier to maximize the race window.
}

Acceptance gate¶

go build ./... clean.
go test -race -count=1 ./cmd/fleet/... green (T1–T6).
golangci-lint run ./cmd/fleet/ clean.
gofmt -l . clean (modulo pre-existing .codex-tmp/).
python3 -m pytest skills/ -q green (no skill changes; sanity).
The existing live-coord veto test in cmd/fleet/dispatch_test.go still passes (veto stays as defense in depth).
PR body documents:
The race trace (link this design doc).
That the existing veto is intentionally kept.
The operator-facing contention error for spark-style scenarios.

Out of scope¶

Cleaning up the existing projects-spark duplicate (operator decides between 72ea51b4 and 04f00601 and fleet rm's the loser).
TUI [a] debounce (UX polish; CLI gate is the load-bearing fix).
Cross-host lock support (single-host model; lock file is on local FS).
coordinator.lock consolidation (skill-owned, separate concern).
Backfilling lock for already-running coords.

Stack relationship¶

Branches off worker/rc-coord-auto-marker-282c (PR #163 head) — both PRs touch cmd/fleet/dispatch.go and cmd/fleet/handoff.go. Per memory feedback_stacked_prs_no_merge_wait, stack don't wait. Finisher rebases to main after #163 merges.

Change 6 — handoffop drain-path RC marker backfill (bundled)¶

PR #163's reviewer iter-7 caught one half of the drain-path RC injection bug (gating the inject on isCoordHandoffForAgent). The deferred [P2] in PR #163's body — "handoffop drain-path missing marker write for coord-handoff backfill" — is the other half: for coord handoffs that go through the drain path (TUI's background drainer processing a stale queue file), internal/handoffop/handoffop.go injects the flag but does NOT write the rc-enabled marker first. So drain-path handoffs of pre-v0.12.1 coords still spawn replacements without RC.

cmd/fleet/handoff.go (PR #163 line 767-790) writes the marker BEFORE the inject:

if isCoordHandoffForProject(oldRec.Project, oldRec.ID) {
    if err := writeMarkerFn(oldRec.Project); err != nil { /* warn + continue */ }
    rewritten := injectRemoteControlFlagProject(command, rcSessionName, oldRec.Project)
    if !sameCommand(rewritten, command) { rewrittenExecArgv = rewritten }
}

internal/handoffop/handoffop.go (around line 553-590 per iter-7 fix) ONLY gates the inject on isCoordHandoffForAgent — no marker write. Add the same marker-write helper call.

Implementation¶

In internal/handoffop/handoffop.go, inside the existing if isCoordHandoffForAgent(oldRec.Project, oldRec.ID) { ... } block (where the inject already runs), add the marker write BEFORE the inject:

if isCoordHandoffForAgent(oldRec.Project, oldRec.ID) {
    if err := rc.WriteMarker(oldRec.Project); err != nil {
        fmt.Fprintf(stderr,
            "warning: rc.WriteMarker(%q) failed during drain handoff: %v "+
            "(continuing with plain claude argv; run `fleet rc up %s` to recover)\n",
            oldRec.Project, err, oldRec.Project)
    }
    // existing inject call follows
}

Same package-var seam (writeMarkerFn = rc.WriteMarker) for test injection, mirroring cmd/fleet/dispatch.go. Failure is non-fatal — log + continue.

Tests¶

New e2e in internal/handoffop/handoffop_test.go:

T-drain-coord-marker-write — drain handoff of recognized coord writes the rc-enabled marker before injecting the flag.
T-drain-worker-no-marker-write — drain handoff of a worker (isCoordHandoffForAgent=false) does NOT write the marker (preserves push-storm protection).
T-drain-marker-write-failure-non-fatal — stub writeMarkerFn to inject failure; verify the inject still runs (now without marker, so inject no-ops) and the handoffop returns success.

Why bundled (not separate PR)¶

Per memory feedback_bundle_related_prs: "PR scope is business-logic; commit scope is a useful change — one PR per business unit." The atomic spawn-gate and the drain-path marker-write are both coord lifecycle correctness — same business unit. Operator explicitly approved bundling in 2026-05-19 chat.

PR #163's reviewer marked this as deferred [P2], but observed production behavior elevates it: existing pre-v0.12.1 coords (like projects-spark) need their NEXT drain-path handoff to backfill the marker, otherwise the user sees "handoff doesn't get remote-control" forever.

Change 7 — extend lock to handoffop drain-path (v3 addition)¶

Codex iter-1 [P1] from v2 review: cmd/fleet/dispatch.go:245-250 lock only covers concurrent runDispatch; does NOT serialize against handoffop.spawnAndRetire drain path. TUI [a] racing an in-flight drain replacement still double-spawns. This IS the production race trace at design-doc problem section.

Fix¶

internal/handoffop/handoffop.go already gates the marker-write + RC inject on isCoordHandoffForAgent(oldRec.Project, oldRec.ID). INSIDE that block, BEFORE the spawn call, acquire the SAME project-level lock that cmd/fleet/dispatch.go uses. Release after spawn returns.

Wiring constraint: internal/handoffop cannot import cmd/fleet (cycle). Solution — move acquireCoordSpawnLock from cmd/fleet/coord_spawn_lock.go to a NEW package internal/coordlock/coordlock.go (or internal/state/'s lock helpers). Both callers import from there.

// internal/coordlock/coordlock.go (NEW)
package coordlock

func Acquire(project string) (release func(), err error) { ... }

// cmd/fleet/dispatch.go — replace local acquireCoordSpawnLock call:
release, err := coordlock.Acquire(opts.project)

// internal/handoffop/handoffop.go — inside if isCoordHandoffForAgent(...) block:
if isCoordHandoffForAgent(oldRec.Project, oldRec.ID) {
    release, err := coordlock.Acquire(oldRec.Project)
    if err != nil {
        // Drain path can't fail-fast like dispatch can — there's already
        // a queue file committed. Surface the contention as a warning
        // and continue (defense-in-depth veto at retire time will catch
        // a duplicate that slips through; the operator-visible error
        // path is the same as for the dispatch CLI).
        fmt.Fprintf(stderr,
            "warning: coordlock.Acquire(%q) contended during drain handoff: %v "+
            "(continuing; the duplicate-coord risk window is the spawn time)\n",
            oldRec.Project, err)
    } else {
        defer release()
    }
    // existing marker write + inject + spawn calls follow
}

Why warn-and-continue (not fail-fast) on drain path: a drain handoff has a queue file already committed; bailing out would orphan the queue file. The dispatch CLI can fail-fast because no on-disk state was mutated. The drain path matches PR #163's existing pattern of "non-fatal warning" inside this block.

Tests (in addition to v2 §E2E)¶

T7 — coordlock.Acquire blocks across both call sites. Goroutine 1 acquires via cmd/fleet/dispatch.go (call coordlock.Acquire directly OR run runDispatch through a test harness). Goroutine 2 attempts via handoffop.spawnAndRetire. Second sees the contention warning, the FIRST holds. Verify no duplicate agent records written.
T8 — drain-path lock release on function return. Hold lock in handoffop block → return → second acquire succeeds.
T9 — drain-path contention degrades gracefully. When lock is held by another caller, drain emits the warning + continues to spawn. (Documents the chosen tradeoff; the duplicate-coord risk window persists in this case but the queue file is preserved.)

Files added/changed (delta from v2)¶

internal/coordlock/coordlock.go (NEW) — extracts acquireCoordSpawnLock. Stable public API for both callers.
internal/coordlock/coordlock_test.go (NEW) — moves T1–T5 here (T6 stays in cmd/fleet/ since it's a concurrent-runDispatch integration test).
cmd/fleet/coord_spawn_lock.go — DELETED. Caller in dispatch.go now imports + calls coordlock.Acquire.
cmd/fleet/dispatch.go — swap local call for coordlock.Acquire.
internal/handoffop/handoffop.go — add coordlock.Acquire inside if isCoordHandoffForAgent { ... }, BEFORE the existing marker write.
internal/handoffop/handoffop_test.go — add T7/T8/T9.

Why this is in scope for the SAME PR¶

Per memory feedback_bundle_related_prs: same business unit (coord lifecycle correctness against concurrent spawn). Codex iter-1 explicitly named this gap, so deferring it would mean shipping a known-incomplete fix to the EXACT bug the design doc claims to close.

Operator approval timestamp¶

G2 cleared (v2): 2026-05-19, operator answers to design questions in chat (NB-flock + immediate fail; operator handles spark dup cleanup). G2 cleared (v3): 2026-05-19, operator picked Path B in response to codex iter-1 [P1] surfaced by the v2 reviewer.

DESIGN: Atomic Coord-Spawn Gate¶

TL;DR¶

Problem reproduction¶

Root cause (race trace)¶

Design decision¶

Implementation¶

Change 1 — cmd/fleet/dispatch.go¶

Change 2 — new helper cmd/fleet/coord_spawn_lock.go¶

Change 3 — no changes to TUI¶

Change 4 — no changes to internal/rc, internal/spawn, internal/handoff/¶

E2E test¶

T1 — second concurrent coord-spawn for same project fails fast¶

T2 — lock released on function return; subsequent acquire OK¶

T3 — different projects don't block each other¶

T4 — empty project (legacy path) skips the lock¶

T5 — operator-facing error message includes holder PID¶

T6 — concurrent runDispatch invocations: only one spawns¶

Acceptance gate¶

Out of scope¶

Stack relationship¶

Change 6 — handoffop drain-path RC marker backfill (bundled)¶

Implementation¶

Tests¶

Why bundled (not separate PR)¶

Change 7 — extend lock to handoffop drain-path (v3 addition)¶

Fix¶

Tests (in addition to v2 §E2E)¶

Files added/changed (delta from v2)¶

Why this is in scope for the SAME PR¶

Operator approval timestamp¶

Change 1 — `cmd/fleet/dispatch.go`¶

Change 2 — new helper `cmd/fleet/coord_spawn_lock.go`¶

Change 4 — no changes to `internal/rc`, `internal/spawn`, `internal/handoff/`¶