DESIGN: Atomic Coord-Spawn Gate¶
Status: v3 — operator approved 2026-05-19 (G2 cleared on v3). v2 bundled handoffop drain-path RC marker backfill; v3 adds Change 7 — extend the coord-spawn lock to also cover handoffop drain-path spawn (closes the second half of the race that codex iter-1 surfaced in v2 review). Ready to re-dispatch.
Author: coord agent 5f05977b (projects-fleet)
Reviewers: operator (G2); codex + /review at PR time per CLAUDE.md §4.
Target version: v0.12.2 (point release on top of v0.12.1).
Created: 2026-05-19
Priority: P0 — duplicate coords observed in production (projects-spark: agents 72ea51b4 + 04f00601); operator: "we don't allow this." "only one coord for sure." Plus second observed P0: handoff replacement doesn't carry --remote-control (PR #163 fixes future fresh spawns but the deferred [P2] backfill case for pre-#163 coord handoffs remains).
Parent: docs/DESIGN-rc-listener-lifecycle.md v8 (coord lifecycle ownership). Sibling of docs/DESIGN-rc-coord-auto-marker.md (v0.12.1 P0, PR #163).
Bundles (per operator memory feedback_bundle_related_prs): atomic spawn gate + handoffop drain-path RC backfill (PR #163 deferred [P2] (2) "handoffop missing marker write for coord-handoff backfill").
Does NOT change: the coord skill's coordinator.lock (skill-owned post-spawn supervision lock — different concern), the TUI [a] keybinding flow, rc.Enabled semantics, or the worker dispatch path.
TL;DR¶
fleet dispatch --coord-spawn has a TOCTOU race. The veto at cmd/fleet/dispatch.go:522 reads coordRecordExistsInList against a pre-fetched agent-records snapshot; two concurrent dispatches both see "no existing coord" and both spawn. Confirmed against projects-spark today:
- 72ea51b4 — has last_handoff_path, came from a handoff replacement.
- 04f00601 — no handoff path, fresh [a] press from TUI mid-handoff.
Both are alive, both think they're the coord, both compete for the skill-owned coordinator.lock (which serializes ONLY supervisor ticks, not spawn).
This design adds an OS-level NB-flock at the dispatch CLI layer (coord-spawn.lock) that serializes the entire (veto-check, agent-record-write, spawn-return) tuple for a given project. Second-into-the-lock fails immediately with a clean error; first acquires + spawns + releases.
rc.Enabled and the existing veto stay intact. The lock is a new chokepoint, not a replacement.
Problem reproduction¶
$ ls ~/.fleet/agents/ | xargs -I{} cat ~/.fleet/agents/{} | jq -r 'select(.project=="projects-spark") | "\(.id) pid=\(.pid) handoff=\(.last_handoff_path)"'
04f00601 pid=79493 handoff=null
72ea51b4 pid=78455 handoff=/Users/pinkbear/.fleet/handoffs/49bc8a03-20260519-064913-3c66.md
Both alive (both PIDs respond to kill -0), both have distinct tmux sessions, both have agent records, both are running claude --dangerously-skip-permissions. The skill's coordinator.lock only blocks duplicate ticks; nothing blocks duplicate SPAWN.
Root cause (race trace)¶
- Operator triggers handoff on
49bc8a03. Coord writes handoff doc, exits. - Handoff infrastructure starts spawning replacement (will become
72ea51b4). - Spawn is slow (fleet rc-listener-bootstrap-sk-3e98 + claude --dangerously-skip-permissions takes seconds; tmux session not yet alive).
- Operator sees no live coord in TUI dashboard, presses
[a]to attach. - TUI's
[a]handler runsfindExistingCoordForProject(m.records, "projects-spark"): -m.recordsis a stale snapshot from the last refresh tick. The handoff replacement hasn't registered yet. - Returns(nil, false)— falls through tostartCoordSpawn. - TUI shells out
fleet dispatch coord-projects-spark --project projects-spark --coord-spawn. - CLI's runDispatch reads agent-records list, hits veto at line 522:
-
coordStateFresh("projects-spark")may be FALSE (the handoff replacement hasn't ticked coord-state.json yet). -coordRecordExistsInListreturns FALSE (handoff replacement hasn't written its record yet). - Veto SKIPPED. - runDispatch proceeds to write a fresh agent record (
04f00601), spawn tmux session. - Meanwhile, the handoff replacement (
72ea51b4) completes its own spawn-and-record-write. - Two coords, same project, both alive. Skill-side
coordinator.lockthen arbitrates ticks (the loser sees "lock busy" and exits cleanly), so they don't trample each other's state — but the OPERATOR sees a dashboard with two rows for the same project and both panes are responsive. Confusing + wasteful + violates the "one coord per project" invariant.
The race window is the gap between (handoff spawn started) and (handoff spawn registered + tmux session alive). On a busy mac the window is multi-second; on a fast machine still hundreds of milliseconds — enough for a fast double-press of [a] to race.
Design decision¶
Add an NB-flock at ~/.fleet/projects/<p>/.locks/coord-spawn.lock held across the entire coord-spawn critical section in runDispatch.
- Acquired at the top of
runDispatchwhenopts.coordSpawn == true(before any check or veto). - Mode:
LOCK_EX | LOCK_NB. Held by the first dispatcher to acquire; second-into-the-lock fails immediately with a clean error. - Released via
deferon function return — both success path and error path. - Lock file lives in
.locks/next to the existingcoordinator.lock(skill-owned),state.lock(state-package owned), and per-workerworker-*.lockfiles. Sibling, not a replacement. - Operator-facing error on contention:
refusing to spawn coord for project %q: another coord-spawn is in progress (lock pid: <pid>). attach to the in-flight coord via TUI [a] once it finishes booting.
Why NB-flock (not blocking with timeout):
- Operator double-presses are visible: the second [a] press surfaces an explicit "another spawn in progress" message rather than a silent wait → the operator learns the correct mental model ("don't double-press; wait or attach").
- The first dispatcher finishes in <5s for happy-path spawns. A blocking flock would either timeout (re-introducing race) or sit for unbounded time on a stuck spawn.
- Matches existing rc.go (internal/rc/rc.go:withLock) and skill coordinator.lock conventions — both NB-flock.
Why keep the existing veto at line 522:
- Defense in depth. The veto catches cross-tmux-socket and cross-host cases (where the lock file might not be reachable, e.g., ~/.fleet/ mounted differently). Vanishingly rare today but not zero.
- The veto's error message is also more informative for the "live coord, different socket" failure mode than a bare flock-contention error.
Implementation¶
Change 1 — cmd/fleet/dispatch.go¶
At the top of runDispatch, after opts.coordSpawn validation but BEFORE any other check (definitely before line 522):
// v0.12.2 P0: atomic coord-spawn gate. NB-flock per-project closes
// the TOCTOU race between two concurrent --coord-spawn dispatches
// (TUI double-press during handoff was producing duplicate coords —
// e.g. projects-spark agents 72ea51b4 + 04f00601 on 2026-05-19).
// Acquired BEFORE the veto/records-list read; held through agent
// record write + spawn invocation; released on runDispatch return.
//
// The existing live-coord veto at line ~522 stays — defense in depth
// for cross-socket / cross-host cases the flock doesn't reach.
if opts.coordSpawn && opts.project != "" {
release, err := acquireCoordSpawnLock(opts.project)
if err != nil {
return err // operator-friendly message inside acquireCoordSpawnLock
}
defer release()
}
Change 2 — new helper cmd/fleet/coord_spawn_lock.go¶
package main
import (
"fmt"
"os"
"path/filepath"
"syscall"
"github.com/edisonshen/fleet/internal/state"
)
// acquireCoordSpawnLock takes an NB-flock on
// ~/.fleet/projects/<project>/.locks/coord-spawn.lock. Returns a
// release function on success. On contention, returns a clear
// operator-facing error.
//
// Naming: "coord-spawn.lock" — distinct from the skill-owned
// "coordinator.lock" (post-spawn supervisor lock) and the
// state-package "state.lock". This file gates SPAWN only.
func acquireCoordSpawnLock(project string) (release func(), err error) {
pdir, err := state.ProjectDir(project)
if err != nil {
return nil, fmt.Errorf("acquireCoordSpawnLock: project dir: %w", err)
}
lockDir := filepath.Join(pdir, ".locks")
if err := os.MkdirAll(lockDir, 0o755); err != nil {
return nil, fmt.Errorf("acquireCoordSpawnLock: mkdir locks: %w", err)
}
lockPath := filepath.Join(lockDir, "coord-spawn.lock")
f, err := os.OpenFile(lockPath, os.O_RDWR|os.O_CREATE, 0o644)
if err != nil {
return nil, fmt.Errorf("acquireCoordSpawnLock: open: %w", err)
}
if err := syscall.Flock(int(f.Fd()), syscall.LOCK_EX|syscall.LOCK_NB); err != nil {
_ = f.Close()
if err == syscall.EWOULDBLOCK {
// Read current holder for the error message — best-effort.
holder := ""
if b, rerr := os.ReadFile(lockPath); rerr == nil {
holder = string(b)
}
return nil, fmt.Errorf(
"refusing to spawn coord for project %q: another coord-spawn is in progress%s. "+
"attach to the in-flight coord via TUI [a] once it finishes booting",
project,
func() string {
if holder != "" {
return fmt.Sprintf(" (lock holder: %s)", holder)
}
return ""
}(),
)
}
return nil, fmt.Errorf("acquireCoordSpawnLock: flock: %w", err)
}
// Write holder body (PID-based; best-effort, not load-bearing).
_ = f.Truncate(0)
_, _ = fmt.Fprintf(f, "pid=%d\n", os.Getpid())
return func() {
_ = syscall.Flock(int(f.Fd()), syscall.LOCK_UN)
_ = f.Close()
}, nil
}
Change 3 — no changes to TUI¶
The TUI's [a] handler keeps its existing findExistingCoordForProject fast-path. On contention, the CLI's flock failure surfaces back as a flash message via the existing dispatch-failure path. No new TUI lock; CLI is the chokepoint.
Change 4 — no changes to internal/rc, internal/spawn, internal/handoff/¶
The lock is dispatch-layer concern. None of those packages need to know.
E2E test¶
New test file: cmd/fleet/dispatch_coord_spawn_lock_test.go.
T1 — second concurrent coord-spawn for same project fails fast¶
func TestCoordSpawn_SecondConcurrentFails(t *testing.T) {
fleetHome := t.TempDir()
t.Setenv("FLEET_HOME", fleetHome)
project := "test-spawn-race"
// Goroutine 1: acquire and HOLD the lock (simulates first dispatch in-flight).
release, err := acquireCoordSpawnLock(project)
if err != nil { t.Fatalf("first acquire: %v", err) }
defer release()
// Goroutine 2: attempt second acquire — should fail with EWOULDBLOCK + clean error.
_, err = acquireCoordSpawnLock(project)
if err == nil {
t.Fatal("second acquire succeeded — race not closed")
}
if !strings.Contains(err.Error(), "another coord-spawn is in progress") {
t.Fatalf("error message drift: %v", err)
}
}
T2 — lock released on function return; subsequent acquire OK¶
func TestCoordSpawn_LockReleasedOnReturn(t *testing.T) {
// first: acquire + release.
// second: acquire — should succeed.
// Verifies defer release() actually runs.
}
T3 — different projects don't block each other¶
func TestCoordSpawn_PerProjectLockIsolation(t *testing.T) {
// acquire("proj-a") + hold.
// acquire("proj-b") — should succeed (independent lock files).
}
T4 — empty project (legacy path) skips the lock¶
func TestCoordSpawn_EmptyProjectSkipsLock(t *testing.T) {
// The runDispatch guard is `opts.coordSpawn && opts.project != ""`.
// Test that with opts.project="" the spawn proceeds without acquiring.
// (Worker-dispatch path; coord-spawn requires a project per validateOpts.)
// This test exercises the helper-level branch.
}
T5 — operator-facing error message includes holder PID¶
func TestCoordSpawn_ContentionErrorIncludesHolder(t *testing.T) {
// First acquire writes "pid=N\n" to the lock file.
// Second acquire reads it back into the error message.
// Verify pid format in error.
}
T6 — concurrent runDispatch invocations: only one spawns¶
Integration test (heavier — may be a build-tag-gated integration_test.go):
func TestRunDispatch_Concurrent_OnlyOneSpawns(t *testing.T) {
// Two goroutines invoke runDispatch with --coord-spawn for the same project.
// Assert: exactly ONE writes an agent record; the other returns the lock error.
// Use a synchronization barrier to maximize the race window.
}
Acceptance gate¶
go build ./...clean.go test -race -count=1 ./cmd/fleet/...green (T1–T6).golangci-lint run ./cmd/fleet/clean.gofmt -l .clean (modulo pre-existing.codex-tmp/).python3 -m pytest skills/ -qgreen (no skill changes; sanity).- The existing live-coord veto test in
cmd/fleet/dispatch_test.gostill passes (veto stays as defense in depth). - PR body documents:
- The race trace (link this design doc).
- That the existing veto is intentionally kept.
- The operator-facing contention error for spark-style scenarios.
Out of scope¶
- Cleaning up the existing
projects-sparkduplicate (operator decides between72ea51b4and04f00601andfleet rm's the loser). - TUI
[a]debounce (UX polish; CLI gate is the load-bearing fix). - Cross-host lock support (single-host model; lock file is on local FS).
coordinator.lockconsolidation (skill-owned, separate concern).- Backfilling lock for already-running coords.
Stack relationship¶
Branches off worker/rc-coord-auto-marker-282c (PR #163 head) — both PRs touch cmd/fleet/dispatch.go and cmd/fleet/handoff.go. Per memory feedback_stacked_prs_no_merge_wait, stack don't wait. Finisher rebases to main after #163 merges.
Change 6 — handoffop drain-path RC marker backfill (bundled)¶
PR #163's reviewer iter-7 caught one half of the drain-path RC injection bug (gating the inject on isCoordHandoffForAgent). The deferred [P2] in PR #163's body — "handoffop drain-path missing marker write for coord-handoff backfill" — is the other half: for coord handoffs that go through the drain path (TUI's background drainer processing a stale queue file), internal/handoffop/handoffop.go injects the flag but does NOT write the rc-enabled marker first. So drain-path handoffs of pre-v0.12.1 coords still spawn replacements without RC.
cmd/fleet/handoff.go (PR #163 line 767-790) writes the marker BEFORE the inject:
if isCoordHandoffForProject(oldRec.Project, oldRec.ID) {
if err := writeMarkerFn(oldRec.Project); err != nil { /* warn + continue */ }
rewritten := injectRemoteControlFlagProject(command, rcSessionName, oldRec.Project)
if !sameCommand(rewritten, command) { rewrittenExecArgv = rewritten }
}
internal/handoffop/handoffop.go (around line 553-590 per iter-7 fix) ONLY gates the inject on isCoordHandoffForAgent — no marker write. Add the same marker-write helper call.
Implementation¶
In internal/handoffop/handoffop.go, inside the existing if isCoordHandoffForAgent(oldRec.Project, oldRec.ID) { ... } block (where the inject already runs), add the marker write BEFORE the inject:
if isCoordHandoffForAgent(oldRec.Project, oldRec.ID) {
if err := rc.WriteMarker(oldRec.Project); err != nil {
fmt.Fprintf(stderr,
"warning: rc.WriteMarker(%q) failed during drain handoff: %v "+
"(continuing with plain claude argv; run `fleet rc up %s` to recover)\n",
oldRec.Project, err, oldRec.Project)
}
// existing inject call follows
}
Same package-var seam (writeMarkerFn = rc.WriteMarker) for test injection, mirroring cmd/fleet/dispatch.go. Failure is non-fatal — log + continue.
Tests¶
New e2e in internal/handoffop/handoffop_test.go:
- T-drain-coord-marker-write — drain handoff of recognized coord writes the rc-enabled marker before injecting the flag.
- T-drain-worker-no-marker-write — drain handoff of a worker (isCoordHandoffForAgent=false) does NOT write the marker (preserves push-storm protection).
- T-drain-marker-write-failure-non-fatal — stub
writeMarkerFnto inject failure; verify the inject still runs (now without marker, so inject no-ops) and the handoffop returns success.
Why bundled (not separate PR)¶
Per memory feedback_bundle_related_prs: "PR scope is business-logic; commit scope is a useful change — one PR per business unit." The atomic spawn-gate and the drain-path marker-write are both coord lifecycle correctness — same business unit. Operator explicitly approved bundling in 2026-05-19 chat.
PR #163's reviewer marked this as deferred [P2], but observed production behavior elevates it: existing pre-v0.12.1 coords (like projects-spark) need their NEXT drain-path handoff to backfill the marker, otherwise the user sees "handoff doesn't get remote-control" forever.
Change 7 — extend lock to handoffop drain-path (v3 addition)¶
Codex iter-1 [P1] from v2 review: cmd/fleet/dispatch.go:245-250 lock only covers concurrent runDispatch; does NOT serialize against handoffop.spawnAndRetire drain path. TUI [a] racing an in-flight drain replacement still double-spawns. This IS the production race trace at design-doc problem section.
Fix¶
internal/handoffop/handoffop.go already gates the marker-write + RC inject on isCoordHandoffForAgent(oldRec.Project, oldRec.ID). INSIDE that block, BEFORE the spawn call, acquire the SAME project-level lock that cmd/fleet/dispatch.go uses. Release after spawn returns.
Wiring constraint: internal/handoffop cannot import cmd/fleet (cycle). Solution — move acquireCoordSpawnLock from cmd/fleet/coord_spawn_lock.go to a NEW package internal/coordlock/coordlock.go (or internal/state/'s lock helpers). Both callers import from there.
// internal/coordlock/coordlock.go (NEW)
package coordlock
func Acquire(project string) (release func(), err error) { ... }
// cmd/fleet/dispatch.go — replace local acquireCoordSpawnLock call:
release, err := coordlock.Acquire(opts.project)
// internal/handoffop/handoffop.go — inside if isCoordHandoffForAgent(...) block:
if isCoordHandoffForAgent(oldRec.Project, oldRec.ID) {
release, err := coordlock.Acquire(oldRec.Project)
if err != nil {
// Drain path can't fail-fast like dispatch can — there's already
// a queue file committed. Surface the contention as a warning
// and continue (defense-in-depth veto at retire time will catch
// a duplicate that slips through; the operator-visible error
// path is the same as for the dispatch CLI).
fmt.Fprintf(stderr,
"warning: coordlock.Acquire(%q) contended during drain handoff: %v "+
"(continuing; the duplicate-coord risk window is the spawn time)\n",
oldRec.Project, err)
} else {
defer release()
}
// existing marker write + inject + spawn calls follow
}
Why warn-and-continue (not fail-fast) on drain path: a drain handoff has a queue file already committed; bailing out would orphan the queue file. The dispatch CLI can fail-fast because no on-disk state was mutated. The drain path matches PR #163's existing pattern of "non-fatal warning" inside this block.
Tests (in addition to v2 §E2E)¶
- T7 — coordlock.Acquire blocks across both call sites. Goroutine 1 acquires via
cmd/fleet/dispatch.go(callcoordlock.Acquiredirectly OR run runDispatch through a test harness). Goroutine 2 attempts viahandoffop.spawnAndRetire. Second sees the contention warning, the FIRST holds. Verify no duplicate agent records written. - T8 — drain-path lock release on function return. Hold lock in handoffop block → return → second acquire succeeds.
- T9 — drain-path contention degrades gracefully. When lock is held by another caller, drain emits the warning + continues to spawn. (Documents the chosen tradeoff; the duplicate-coord risk window persists in this case but the queue file is preserved.)
Files added/changed (delta from v2)¶
internal/coordlock/coordlock.go(NEW) — extractsacquireCoordSpawnLock. Stable public API for both callers.internal/coordlock/coordlock_test.go(NEW) — moves T1–T5 here (T6 stays incmd/fleet/since it's a concurrent-runDispatch integration test).cmd/fleet/coord_spawn_lock.go— DELETED. Caller indispatch.gonow imports + callscoordlock.Acquire.cmd/fleet/dispatch.go— swap local call forcoordlock.Acquire.internal/handoffop/handoffop.go— addcoordlock.Acquireinsideif isCoordHandoffForAgent { ... }, BEFORE the existing marker write.internal/handoffop/handoffop_test.go— add T7/T8/T9.
Why this is in scope for the SAME PR¶
Per memory feedback_bundle_related_prs: same business unit (coord lifecycle correctness against concurrent spawn). Codex iter-1 explicitly named this gap, so deferring it would mean shipping a known-incomplete fix to the EXACT bug the design doc claims to close.
Operator approval timestamp¶
G2 cleared (v2): 2026-05-19, operator answers to design questions in chat (NB-flock + immediate fail; operator handles spark dup cleanup). G2 cleared (v3): 2026-05-19, operator picked Path B in response to codex iter-1 [P1] surfaced by the v2 reviewer.