Source: DESIGN-coord-spawn-atomic-gate.mdRendered: 2026-05-19 14:58 UTC — Agents read the .md; humans read the .html.

DESIGN: Atomic Coord-Spawn Gate

Status: v3 — operator approved 2026-05-19 (G2 cleared on v3). v2 bundled handoffop drain-path RC marker backfill; v3 adds Change 7 — extend the coord-spawn lock to also cover handoffop drain-path spawn (closes the second half of the race that codex iter-1 surfaced in v2 review). Ready to re-dispatch. Author: coord agent 5f05977b (projects-fleet) Reviewers: operator (G2); codex + /review at PR time per CLAUDE.md §4. Target version: v0.12.2 (point release on top of v0.12.1). Created: 2026-05-19 Priority: P0 — duplicate coords observed in production (projects-spark: agents 72ea51b4 + 04f00601); operator: "we don't allow this." "only one coord for sure." Plus second observed P0: handoff replacement doesn't carry --remote-control (PR #163 fixes future fresh spawns but the deferred [P2] backfill case for pre-#163 coord handoffs remains). Parent: docs/DESIGN-rc-listener-lifecycle.md v8 (coord lifecycle ownership). Sibling of docs/DESIGN-rc-coord-auto-marker.md (v0.12.1 P0, PR #163). Bundles (per operator memory feedback_bundle_related_prs): atomic spawn gate + handoffop drain-path RC backfill (PR #163 deferred [P2] (2) "handoffop missing marker write for coord-handoff backfill"). Does NOT change: the coord skill's coordinator.lock (skill-owned post-spawn supervision lock — different concern), the TUI [a] keybinding flow, rc.Enabled semantics, or the worker dispatch path.


TL;DR

fleet dispatch --coord-spawn has a TOCTOU race. The veto at cmd/fleet/dispatch.go:522 reads coordRecordExistsInList against a pre-fetched agent-records snapshot; two concurrent dispatches both see "no existing coord" and both spawn. Confirmed against projects-spark today: - 72ea51b4 — has last_handoff_path, came from a handoff replacement. - 04f00601 — no handoff path, fresh [a] press from TUI mid-handoff.

Both are alive, both think they're the coord, both compete for the skill-owned coordinator.lock (which serializes ONLY supervisor ticks, not spawn).

This design adds an OS-level NB-flock at the dispatch CLI layer (coord-spawn.lock) that serializes the entire (veto-check, agent-record-write, spawn-return) tuple for a given project. Second-into-the-lock fails immediately with a clean error; first acquires + spawns + releases.

rc.Enabled and the existing veto stay intact. The lock is a new chokepoint, not a replacement.


Problem reproduction

$ ls ~/.fleet/agents/ | xargs -I{} cat ~/.fleet/agents/{} | jq -r 'select(.project=="projects-spark") | "\(.id) pid=\(.pid) handoff=\(.last_handoff_path)"'
04f00601 pid=79493 handoff=null
72ea51b4 pid=78455 handoff=/Users/pinkbear/.fleet/handoffs/49bc8a03-20260519-064913-3c66.md

Both alive (both PIDs respond to kill -0), both have distinct tmux sessions, both have agent records, both are running claude --dangerously-skip-permissions. The skill's coordinator.lock only blocks duplicate ticks; nothing blocks duplicate SPAWN.


Root cause (race trace)

  1. Operator triggers handoff on 49bc8a03. Coord writes handoff doc, exits.
  2. Handoff infrastructure starts spawning replacement (will become 72ea51b4).
  3. Spawn is slow (fleet rc-listener-bootstrap-sk-3e98 + claude --dangerously-skip-permissions takes seconds; tmux session not yet alive).
  4. Operator sees no live coord in TUI dashboard, presses [a] to attach.
  5. TUI's [a] handler runs findExistingCoordForProject(m.records, "projects-spark"): - m.records is a stale snapshot from the last refresh tick. The handoff replacement hasn't registered yet. - Returns (nil, false) — falls through to startCoordSpawn.
  6. TUI shells out fleet dispatch coord-projects-spark --project projects-spark --coord-spawn.
  7. CLI's runDispatch reads agent-records list, hits veto at line 522: - coordStateFresh("projects-spark") may be FALSE (the handoff replacement hasn't ticked coord-state.json yet). - coordRecordExistsInList returns FALSE (handoff replacement hasn't written its record yet). - Veto SKIPPED.
  8. runDispatch proceeds to write a fresh agent record (04f00601), spawn tmux session.
  9. Meanwhile, the handoff replacement (72ea51b4) completes its own spawn-and-record-write.
  10. Two coords, same project, both alive. Skill-side coordinator.lock then arbitrates ticks (the loser sees "lock busy" and exits cleanly), so they don't trample each other's state — but the OPERATOR sees a dashboard with two rows for the same project and both panes are responsive. Confusing + wasteful + violates the "one coord per project" invariant.

The race window is the gap between (handoff spawn started) and (handoff spawn registered + tmux session alive). On a busy mac the window is multi-second; on a fast machine still hundreds of milliseconds — enough for a fast double-press of [a] to race.


Design decision

Add an NB-flock at ~/.fleet/projects/<p>/.locks/coord-spawn.lock held across the entire coord-spawn critical section in runDispatch.

Why NB-flock (not blocking with timeout): - Operator double-presses are visible: the second [a] press surfaces an explicit "another spawn in progress" message rather than a silent wait → the operator learns the correct mental model ("don't double-press; wait or attach"). - The first dispatcher finishes in <5s for happy-path spawns. A blocking flock would either timeout (re-introducing race) or sit for unbounded time on a stuck spawn. - Matches existing rc.go (internal/rc/rc.go:withLock) and skill coordinator.lock conventions — both NB-flock.

Why keep the existing veto at line 522: - Defense in depth. The veto catches cross-tmux-socket and cross-host cases (where the lock file might not be reachable, e.g., ~/.fleet/ mounted differently). Vanishingly rare today but not zero. - The veto's error message is also more informative for the "live coord, different socket" failure mode than a bare flock-contention error.


Implementation

Change 1 — cmd/fleet/dispatch.go

At the top of runDispatch, after opts.coordSpawn validation but BEFORE any other check (definitely before line 522):

// v0.12.2 P0: atomic coord-spawn gate. NB-flock per-project closes
// the TOCTOU race between two concurrent --coord-spawn dispatches
// (TUI double-press during handoff was producing duplicate coords —
// e.g. projects-spark agents 72ea51b4 + 04f00601 on 2026-05-19).
// Acquired BEFORE the veto/records-list read; held through agent
// record write + spawn invocation; released on runDispatch return.
//
// The existing live-coord veto at line ~522 stays — defense in depth
// for cross-socket / cross-host cases the flock doesn't reach.
if opts.coordSpawn && opts.project != "" {
    release, err := acquireCoordSpawnLock(opts.project)
    if err != nil {
        return err  // operator-friendly message inside acquireCoordSpawnLock
    }
    defer release()
}

Change 2 — new helper cmd/fleet/coord_spawn_lock.go

package main

import (
    "fmt"
    "os"
    "path/filepath"
    "syscall"

    "github.com/edisonshen/fleet/internal/state"
)

// acquireCoordSpawnLock takes an NB-flock on
// ~/.fleet/projects/<project>/.locks/coord-spawn.lock. Returns a
// release function on success. On contention, returns a clear
// operator-facing error.
//
// Naming: "coord-spawn.lock" — distinct from the skill-owned
// "coordinator.lock" (post-spawn supervisor lock) and the
// state-package "state.lock". This file gates SPAWN only.
func acquireCoordSpawnLock(project string) (release func(), err error) {
    pdir, err := state.ProjectDir(project)
    if err != nil {
        return nil, fmt.Errorf("acquireCoordSpawnLock: project dir: %w", err)
    }
    lockDir := filepath.Join(pdir, ".locks")
    if err := os.MkdirAll(lockDir, 0o755); err != nil {
        return nil, fmt.Errorf("acquireCoordSpawnLock: mkdir locks: %w", err)
    }
    lockPath := filepath.Join(lockDir, "coord-spawn.lock")
    f, err := os.OpenFile(lockPath, os.O_RDWR|os.O_CREATE, 0o644)
    if err != nil {
        return nil, fmt.Errorf("acquireCoordSpawnLock: open: %w", err)
    }
    if err := syscall.Flock(int(f.Fd()), syscall.LOCK_EX|syscall.LOCK_NB); err != nil {
        _ = f.Close()
        if err == syscall.EWOULDBLOCK {
            // Read current holder for the error message — best-effort.
            holder := ""
            if b, rerr := os.ReadFile(lockPath); rerr == nil {
                holder = string(b)
            }
            return nil, fmt.Errorf(
                "refusing to spawn coord for project %q: another coord-spawn is in progress%s. "+
                    "attach to the in-flight coord via TUI [a] once it finishes booting",
                project,
                func() string {
                    if holder != "" {
                        return fmt.Sprintf(" (lock holder: %s)", holder)
                    }
                    return ""
                }(),
            )
        }
        return nil, fmt.Errorf("acquireCoordSpawnLock: flock: %w", err)
    }
    // Write holder body (PID-based; best-effort, not load-bearing).
    _ = f.Truncate(0)
    _, _ = fmt.Fprintf(f, "pid=%d\n", os.Getpid())
    return func() {
        _ = syscall.Flock(int(f.Fd()), syscall.LOCK_UN)
        _ = f.Close()
    }, nil
}

Change 3 — no changes to TUI

The TUI's [a] handler keeps its existing findExistingCoordForProject fast-path. On contention, the CLI's flock failure surfaces back as a flash message via the existing dispatch-failure path. No new TUI lock; CLI is the chokepoint.

Change 4 — no changes to internal/rc, internal/spawn, internal/handoff/

The lock is dispatch-layer concern. None of those packages need to know.


E2E test

New test file: cmd/fleet/dispatch_coord_spawn_lock_test.go.

T1 — second concurrent coord-spawn for same project fails fast

func TestCoordSpawn_SecondConcurrentFails(t *testing.T) {
    fleetHome := t.TempDir()
    t.Setenv("FLEET_HOME", fleetHome)
    project := "test-spawn-race"
    // Goroutine 1: acquire and HOLD the lock (simulates first dispatch in-flight).
    release, err := acquireCoordSpawnLock(project)
    if err != nil { t.Fatalf("first acquire: %v", err) }
    defer release()
    // Goroutine 2: attempt second acquire — should fail with EWOULDBLOCK + clean error.
    _, err = acquireCoordSpawnLock(project)
    if err == nil {
        t.Fatal("second acquire succeeded — race not closed")
    }
    if !strings.Contains(err.Error(), "another coord-spawn is in progress") {
        t.Fatalf("error message drift: %v", err)
    }
}

T2 — lock released on function return; subsequent acquire OK

func TestCoordSpawn_LockReleasedOnReturn(t *testing.T) {
    // first: acquire + release.
    // second: acquire — should succeed.
    // Verifies defer release() actually runs.
}

T3 — different projects don't block each other

func TestCoordSpawn_PerProjectLockIsolation(t *testing.T) {
    // acquire("proj-a") + hold.
    // acquire("proj-b") — should succeed (independent lock files).
}

T4 — empty project (legacy path) skips the lock

func TestCoordSpawn_EmptyProjectSkipsLock(t *testing.T) {
    // The runDispatch guard is `opts.coordSpawn && opts.project != ""`.
    // Test that with opts.project="" the spawn proceeds without acquiring.
    // (Worker-dispatch path; coord-spawn requires a project per validateOpts.)
    // This test exercises the helper-level branch.
}

T5 — operator-facing error message includes holder PID

func TestCoordSpawn_ContentionErrorIncludesHolder(t *testing.T) {
    // First acquire writes "pid=N\n" to the lock file.
    // Second acquire reads it back into the error message.
    // Verify pid format in error.
}

T6 — concurrent runDispatch invocations: only one spawns

Integration test (heavier — may be a build-tag-gated integration_test.go):

func TestRunDispatch_Concurrent_OnlyOneSpawns(t *testing.T) {
    // Two goroutines invoke runDispatch with --coord-spawn for the same project.
    // Assert: exactly ONE writes an agent record; the other returns the lock error.
    // Use a synchronization barrier to maximize the race window.
}

Acceptance gate


Out of scope


Stack relationship

Branches off worker/rc-coord-auto-marker-282c (PR #163 head) — both PRs touch cmd/fleet/dispatch.go and cmd/fleet/handoff.go. Per memory feedback_stacked_prs_no_merge_wait, stack don't wait. Finisher rebases to main after #163 merges.


Change 6 — handoffop drain-path RC marker backfill (bundled)

PR #163's reviewer iter-7 caught one half of the drain-path RC injection bug (gating the inject on isCoordHandoffForAgent). The deferred [P2] in PR #163's body — "handoffop drain-path missing marker write for coord-handoff backfill" — is the other half: for coord handoffs that go through the drain path (TUI's background drainer processing a stale queue file), internal/handoffop/handoffop.go injects the flag but does NOT write the rc-enabled marker first. So drain-path handoffs of pre-v0.12.1 coords still spawn replacements without RC.

cmd/fleet/handoff.go (PR #163 line 767-790) writes the marker BEFORE the inject:

if isCoordHandoffForProject(oldRec.Project, oldRec.ID) {
    if err := writeMarkerFn(oldRec.Project); err != nil { /* warn + continue */ }
    rewritten := injectRemoteControlFlagProject(command, rcSessionName, oldRec.Project)
    if !sameCommand(rewritten, command) { rewrittenExecArgv = rewritten }
}

internal/handoffop/handoffop.go (around line 553-590 per iter-7 fix) ONLY gates the inject on isCoordHandoffForAgent — no marker write. Add the same marker-write helper call.

Implementation

In internal/handoffop/handoffop.go, inside the existing if isCoordHandoffForAgent(oldRec.Project, oldRec.ID) { ... } block (where the inject already runs), add the marker write BEFORE the inject:

if isCoordHandoffForAgent(oldRec.Project, oldRec.ID) {
    if err := rc.WriteMarker(oldRec.Project); err != nil {
        fmt.Fprintf(stderr,
            "warning: rc.WriteMarker(%q) failed during drain handoff: %v "+
            "(continuing with plain claude argv; run `fleet rc up %s` to recover)\n",
            oldRec.Project, err, oldRec.Project)
    }
    // existing inject call follows
}

Same package-var seam (writeMarkerFn = rc.WriteMarker) for test injection, mirroring cmd/fleet/dispatch.go. Failure is non-fatal — log + continue.

Tests

New e2e in internal/handoffop/handoffop_test.go:

Why bundled (not separate PR)

Per memory feedback_bundle_related_prs: "PR scope is business-logic; commit scope is a useful change — one PR per business unit." The atomic spawn-gate and the drain-path marker-write are both coord lifecycle correctness — same business unit. Operator explicitly approved bundling in 2026-05-19 chat.

PR #163's reviewer marked this as deferred [P2], but observed production behavior elevates it: existing pre-v0.12.1 coords (like projects-spark) need their NEXT drain-path handoff to backfill the marker, otherwise the user sees "handoff doesn't get remote-control" forever.


Change 7 — extend lock to handoffop drain-path (v3 addition)

Codex iter-1 [P1] from v2 review: cmd/fleet/dispatch.go:245-250 lock only covers concurrent runDispatch; does NOT serialize against handoffop.spawnAndRetire drain path. TUI [a] racing an in-flight drain replacement still double-spawns. This IS the production race trace at design-doc problem section.

Fix

internal/handoffop/handoffop.go already gates the marker-write + RC inject on isCoordHandoffForAgent(oldRec.Project, oldRec.ID). INSIDE that block, BEFORE the spawn call, acquire the SAME project-level lock that cmd/fleet/dispatch.go uses. Release after spawn returns.

Wiring constraint: internal/handoffop cannot import cmd/fleet (cycle). Solution — move acquireCoordSpawnLock from cmd/fleet/coord_spawn_lock.go to a NEW package internal/coordlock/coordlock.go (or internal/state/'s lock helpers). Both callers import from there.

// internal/coordlock/coordlock.go (NEW)
package coordlock

func Acquire(project string) (release func(), err error) { ... }
// cmd/fleet/dispatch.go — replace local acquireCoordSpawnLock call:
release, err := coordlock.Acquire(opts.project)
// internal/handoffop/handoffop.go — inside if isCoordHandoffForAgent(...) block:
if isCoordHandoffForAgent(oldRec.Project, oldRec.ID) {
    release, err := coordlock.Acquire(oldRec.Project)
    if err != nil {
        // Drain path can't fail-fast like dispatch can — there's already
        // a queue file committed. Surface the contention as a warning
        // and continue (defense-in-depth veto at retire time will catch
        // a duplicate that slips through; the operator-visible error
        // path is the same as for the dispatch CLI).
        fmt.Fprintf(stderr,
            "warning: coordlock.Acquire(%q) contended during drain handoff: %v "+
            "(continuing; the duplicate-coord risk window is the spawn time)\n",
            oldRec.Project, err)
    } else {
        defer release()
    }
    // existing marker write + inject + spawn calls follow
}

Why warn-and-continue (not fail-fast) on drain path: a drain handoff has a queue file already committed; bailing out would orphan the queue file. The dispatch CLI can fail-fast because no on-disk state was mutated. The drain path matches PR #163's existing pattern of "non-fatal warning" inside this block.

Tests (in addition to v2 §E2E)

Files added/changed (delta from v2)

Why this is in scope for the SAME PR

Per memory feedback_bundle_related_prs: same business unit (coord lifecycle correctness against concurrent spawn). Codex iter-1 explicitly named this gap, so deferring it would mean shipping a known-incomplete fix to the EXACT bug the design doc claims to close.


Operator approval timestamp

G2 cleared (v2): 2026-05-19, operator answers to design questions in chat (NB-flock + immediate fail; operator handles spark dup cleanup). G2 cleared (v3): 2026-05-19, operator picked Path B in response to codex iter-1 [P1] surfaced by the v2 reviewer.