Source: DESIGN-reconcile-pr-by-branch.mdRendered: 2026-05-18 07:12 UTC — Agents read the .md; humans read the .html.

DESIGN: Reconcile by branch→PR lookup

Status: DRAFT v2 — codex round 1 findings folded; operator-approved by acclamation (2026-05-18 session) Author: coord agent (dd05ec05) Reviewers: codex round 1 — PASS with improvements (2026-05-18, findings folded in below) Target version: v0.11.x patch


Symptom

rc-listener-impl-v0-12-ed95 was stuck at status=in-progress (TUI showed !!) for ~4h after PR #159 merged (d5b3ad0 at 2026-05-18T04:52Z). The operator had to manually fleet tasks set rc-listener-impl-v0-12-ed95 status=done.

Root cause (codex round 1 confirmed + extended)

The supervisor reconcile pass in skills/coordinator/loop.py::_reconcile_pass (~line 2310) only completes PR-backed work from two recorded signals:

  1. state.json.phase == "done" AND state.json.pr_url != "" → flip to in-review, set pr_url. (Lines 2351, 2356–2392.)
  2. tasks.md.pr_url != "" AND gh pr checks reports merged → flip to done. (Lines 2421, 2423–2429.)

For rc-listener-impl-v0-12-ed95 both inputs were empty:

Codex round 1 surfaced an additional barrier: the mid-phase guard at loop.py:2346 short-circuits reconcile when state.json.phase in {review-pending, review-done} — so once the handoff chain went off-rails with phase=review-pending, reconcile skipped recovery forever, regardless of PR state on GitHub. The supervisor only raised attention; it could not recover.

Coord-state.json corroboration: ~/.fleet/projects/projects-fleet/coord-state.json contains "review_handoffs_dispatched": ["rc-listener-impl-v0-12-ed95:review-pending"]. The dispatch was registered but never reached completion; the dedupe gate at loop.py:3779 doesn't have a stale-agent recovery path independent of the PR-recovery path.

So merged PRs whose pr_url was never recorded AND whose state.json is stuck in a non-terminal phase are invisible to reconcile, forever.

Goals

Non-goals

Design

Three parts (single PR)

Part A — branch→PR fallback in _reconcile_pass.

The fallback must run BEFORE the mid-phase review-pending/review-done short-circuit at loop.py:2346, otherwise stuck-handoff tasks remain invisible. Insertion point: immediately after _is_worker_alive(...) returns false and before the mid_phase = _read_worker_state(...) block.

# Fallback: worker is gone, tasks.md has no pr_url. Worker may have
# shipped a PR outside the v0.2 state machine (off-rails finisher,
# operator-driven manual merge). Branch lookup is the durable handle.
# Runs BEFORE the review-pending/review-done short-circuit — stuck
# handoff phases must not hide a real merged PR.
if not t.pr_url and t.branch:
    pr = _gh_pr_by_branch(t.branch)   # see helper below
    if pr is not None:
        if pr.merged_at:
            actions.append(_ReconcileAction(
                slug=t.slug,
                new_status="done",
                set_pr_url=pr.url,
                clear_worker=True,
                delete_worker_dir=True,
                note=f"PR {pr.url} merged outside state machine",
                raised_to_user=True,
                raise_text=f"reconcile recovered {t.slug}: {pr.url}",
            ))
            continue
        elif pr.state == "OPEN":
            actions.append(_ReconcileAction(
                slug=t.slug,
                new_status="in-review",
                set_pr_url=pr.url,
                clear_worker=True,
                note=f"PR {pr.url} open; recovered from branch",
                raised_to_user=True,
            ))
            continue
        # CLOSED-unmerged: leave the task untouched; operator decides.

Helper _gh_pr_by_branch(branch: str) -> Optional[_PRSummary] — codex tightened the lookup contract:

Part B — fix _apply_reconcile() ordering bug (codex finding).

_apply_reconcile() at loop.py:2863 sets status FIRST, then pr_url SECOND at loop.py:2867. The docstring says the opposite ("pr_url must be durable on the task entry before the worker dir is rm-rf'd"). The _apply_sentinel() path at loop.py:3078–3084 already gets this right (pr_url, then status, then dir delete).

This creates a crash window: a tick that flipped status=in-review but crashed before writing pr_url leaves the task as in-review with empty pr_url — invisible to both reconcile paths until the next time state.json.phase=done+pr_url happens to be readable.

Fix: invert the order in _apply_reconcile(). New sequence:

1. clear_pr_url (if requested) — first; the field is durable before the rest.
2. set_pr_url (if requested)   — second; same reason.
3. new_status (if requested)   — third; only after pr_url is durable.
4. clear_worker, delete_worker_dir — last; per existing docstring.

This matches the comment intent in _apply_sentinel() and closes the crash window.

Part C — supervisor periodic-tick covers stuck-without-mtime case.

The supervisor at supervisor.py:1257 and :1309 is mtime-triggered + periodic. Stale state.json never changes mtime, so mtime-trigger alone doesn't catch the case. Periodic ticks (POLL_INTERVAL_S default) DO call _reconcile_inflight() — the Part A fallback runs there too. No supervisor-side change needed; the fix is centralized in loop.py.

Test must verify periodic reconcile catches the case without any mtime change to state.json.

Cost

Concurrency / TOCTOU

Failure modes

Failure Behavior
gh rate-limited / 5xx Helper returns None; reconcile leaves task untouched; next tick retries.
Branch shape matches >1 PR (rare; head reuse across forks) --limit 1 picks the first; operator can re-trigger manually if wrong.
Branch deleted post-merge gh pr list --search head:<branch> still finds the historical PR. No change.
Subprocess timeout Treat as None; no flip.

Test plan (codex-extended)

Unit tests in skills/coordinator/tests/test_loop_reconcile_by_branch.py (mock _gh_pr_by_branch and subprocess.run for the helper itself):

  1. Stale review-pending + merged PR by branch → action with new_status=done, set_pr_url=<url>, delete_worker_dir=True. Must run BEFORE the review-pending short-circuit — this is the load-bearing test for the codex-surfaced bug.
  2. Stale review-pending + open PR by branch → action with new_status=in-review, set_pr_url=<url>. State.json untouched.
  3. Multiple PRs for same branch (retry created PR #2) → fallback picks newest by createdAt; older PR ignored.
  4. No PR found → no action emitted (task stays as-is).
  5. Closed-unmerged PR → no action emitted; task left for operator.
  6. gh timeout → no action emitted; no requeue.
  7. gh error (non-zero exit) → no action emitted; no requeue.
  8. Task already has pr_url → fallback skipped; existing gh pr checks path runs unchanged.
  9. Task has no branch → fallback skipped; no gh call.
  10. Periodic supervisor reconcile catches the case without mtime changes to state.json — simulates the supervisor tick path; verifies stale phase=review-pending + merged PR flips task to done.
  11. _apply_reconcile() ordering (Part B test): with set_pr_url + new_status both set, verify the pr_url write happens BEFORE the status write in the _run_fleet call sequence. Asserts the codex-surfaced ordering invariant.

Integration test: spin up a fake-gh subprocess fixture; assert one gh call per stuck task per tick (no duplicate calls).

Out of scope (file as P2/P3 follow-ups if needed)

Changelog