DESIGN: Reconcile by branch→PR lookup¶
Status: DRAFT v2 — codex round 1 findings folded; operator-approved by acclamation (2026-05-18 session) Author: coord agent (dd05ec05) Reviewers: codex round 1 — PASS with improvements (2026-05-18, findings folded in below) Target version: v0.11.x patch
Symptom¶
rc-listener-impl-v0-12-ed95 was stuck at status=in-progress (TUI showed !!) for ~4h after PR #159 merged (d5b3ad0 at 2026-05-18T04:52Z). The operator had to manually fleet tasks set rc-listener-impl-v0-12-ed95 status=done.
Root cause (codex round 1 confirmed + extended)¶
The supervisor reconcile pass in skills/coordinator/loop.py::_reconcile_pass (~line 2310) only completes PR-backed work from two recorded signals:
state.json.phase == "done"ANDstate.json.pr_url != ""→ flip toin-review, setpr_url. (Lines 2351, 2356–2392.)tasks.md.pr_url != ""ANDgh pr checksreportsmerged→ flip todone. (Lines 2421, 2423–2429.)
For rc-listener-impl-v0-12-ed95 both inputs were empty:
- The worker chain went off-rails (review→finisher dispatch ran outside the v0.2 state-machine contract).
state.jsonfroze atphase=review-pendingupdated_at2026-05-17T07:47Z. - The finisher contract in
dispatch.py:695requiresfleet workers update --phase done --pr-url <url>after PR open — that call never landed. - Because state.json never reached
phase=done+pr_url,tasks.md.pr_urlwas never populated by the reconcile flip. - Without
pr_url, the CI-merged branch (path 2) never runs.
Codex round 1 surfaced an additional barrier: the mid-phase guard at loop.py:2346 short-circuits reconcile when state.json.phase in {review-pending, review-done} — so once the handoff chain went off-rails with phase=review-pending, reconcile skipped recovery forever, regardless of PR state on GitHub. The supervisor only raised attention; it could not recover.
Coord-state.json corroboration: ~/.fleet/projects/projects-fleet/coord-state.json contains "review_handoffs_dispatched": ["rc-listener-impl-v0-12-ed95:review-pending"]. The dispatch was registered but never reached completion; the dedupe gate at loop.py:3779 doesn't have a stale-agent recovery path independent of the PR-recovery path.
So merged PRs whose pr_url was never recorded AND whose state.json is stuck in a non-terminal phase are invisible to reconcile, forever.
Goals¶
- Detect merged PRs that bypassed the state-machine contract.
- Auto-flip stuck tasks to
done(orin-reviewif not yet merged) without operator intervention. - Single bounded
ghcall per stuck task per tick. - Idempotent: re-running on an already-correct state is a no-op.
Non-goals¶
- Rewriting the v0.2 worker→reviewer→finisher contract (that's PR2 of dispatch-lifecycle).
- Discovering PRs from arbitrary commits / squash history. Branch-name lookup is sufficient.
- Cross-host / cross-repo PR matching.
Design¶
Three parts (single PR)¶
Part A — branch→PR fallback in _reconcile_pass.
The fallback must run BEFORE the mid-phase review-pending/review-done short-circuit at loop.py:2346, otherwise stuck-handoff tasks remain invisible. Insertion point: immediately after _is_worker_alive(...) returns false and before the mid_phase = _read_worker_state(...) block.
# Fallback: worker is gone, tasks.md has no pr_url. Worker may have
# shipped a PR outside the v0.2 state machine (off-rails finisher,
# operator-driven manual merge). Branch lookup is the durable handle.
# Runs BEFORE the review-pending/review-done short-circuit — stuck
# handoff phases must not hide a real merged PR.
if not t.pr_url and t.branch:
pr = _gh_pr_by_branch(t.branch) # see helper below
if pr is not None:
if pr.merged_at:
actions.append(_ReconcileAction(
slug=t.slug,
new_status="done",
set_pr_url=pr.url,
clear_worker=True,
delete_worker_dir=True,
note=f"PR {pr.url} merged outside state machine",
raised_to_user=True,
raise_text=f"reconcile recovered {t.slug}: {pr.url}",
))
continue
elif pr.state == "OPEN":
actions.append(_ReconcileAction(
slug=t.slug,
new_status="in-review",
set_pr_url=pr.url,
clear_worker=True,
note=f"PR {pr.url} open; recovered from branch",
raised_to_user=True,
))
continue
# CLOSED-unmerged: leave the task untouched; operator decides.
Helper _gh_pr_by_branch(branch: str) -> Optional[_PRSummary] — codex tightened the lookup contract:
- Use
gh pr list --head <branch> --state all --json number,state,url,mergedAt,createdAt,updatedAt,headRefName --limit 10(mirrors existing_probe_branch_prs()atloop.py:2148;--headis the documented head-filter flag). - NOT
--search head:<branch> --limit 1— search-string matching is fuzzy andlimit 1picks the wrong PR if a retry created multiple PRs on the same branch. - Pick the newest matching PR by
createdAt(ties: highestnumber). - 5-second timeout; subprocess.TimeoutExpired → return None (treat as "no fallback this tick").
- gh error / empty result → return None. Never requeue a stale handoff just because GitHub was unavailable.
Part B — fix _apply_reconcile() ordering bug (codex finding).
_apply_reconcile() at loop.py:2863 sets status FIRST, then pr_url SECOND at loop.py:2867. The docstring says the opposite ("pr_url must be durable on the task entry before the worker dir is rm-rf'd"). The _apply_sentinel() path at loop.py:3078–3084 already gets this right (pr_url, then status, then dir delete).
This creates a crash window: a tick that flipped status=in-review but crashed before writing pr_url leaves the task as in-review with empty pr_url — invisible to both reconcile paths until the next time state.json.phase=done+pr_url happens to be readable.
Fix: invert the order in _apply_reconcile(). New sequence:
1. clear_pr_url (if requested) — first; the field is durable before the rest.
2. set_pr_url (if requested) — second; same reason.
3. new_status (if requested) — third; only after pr_url is durable.
4. clear_worker, delete_worker_dir — last; per existing docstring.
This matches the comment intent in _apply_sentinel() and closes the crash window.
Part C — supervisor periodic-tick covers stuck-without-mtime case.
The supervisor at supervisor.py:1257 and :1309 is mtime-triggered + periodic. Stale state.json never changes mtime, so mtime-trigger alone doesn't catch the case. Periodic ticks (POLL_INTERVAL_S default) DO call _reconcile_inflight() — the Part A fallback runs there too. No supervisor-side change needed; the fix is centralized in loop.py.
Test must verify periodic reconcile catches the case without any mtime change to state.json.
Cost¶
- Worst case: one extra
ghcall per stuck task per reconcile tick. Reconcile runs every ~30s. For a project with N stuck-without-pr tasks, that's N calls per tick. Today N is typically 0; abnormal-case is 1–3. gh pr list --searchwith--limit 1and--state allis a single GraphQL call (~200ms).
Concurrency / TOCTOU¶
- Two ticks could both observe the same merged PR and try to flip the task to
done. The action queue is serialized inside_apply_reconcile; second tick's stale read flips an already-done task to done (idempotent). - A worker that finishes between the gh-lookup and the action apply could ALSO write pr_url + flip to in-review via the state.json branch. The action queue's order-of-arrival wins; both branches write the same pr_url so no divergence.
Failure modes¶
| Failure | Behavior |
|---|---|
gh rate-limited / 5xx |
Helper returns None; reconcile leaves task untouched; next tick retries. |
| Branch shape matches >1 PR (rare; head reuse across forks) | --limit 1 picks the first; operator can re-trigger manually if wrong. |
| Branch deleted post-merge | gh pr list --search head:<branch> still finds the historical PR. No change. |
| Subprocess timeout | Treat as None; no flip. |
Test plan (codex-extended)¶
Unit tests in skills/coordinator/tests/test_loop_reconcile_by_branch.py (mock _gh_pr_by_branch and subprocess.run for the helper itself):
- Stale review-pending + merged PR by branch → action with
new_status=done,set_pr_url=<url>,delete_worker_dir=True. Must run BEFORE the review-pending short-circuit — this is the load-bearing test for the codex-surfaced bug. - Stale review-pending + open PR by branch → action with
new_status=in-review,set_pr_url=<url>. State.json untouched. - Multiple PRs for same branch (retry created PR #2) → fallback picks newest by
createdAt; older PR ignored. - No PR found → no action emitted (task stays as-is).
- Closed-unmerged PR → no action emitted; task left for operator.
ghtimeout → no action emitted; no requeue.gherror (non-zero exit) → no action emitted; no requeue.- Task already has pr_url → fallback skipped; existing
gh pr checkspath runs unchanged. - Task has no branch → fallback skipped; no
ghcall. - Periodic supervisor reconcile catches the case without mtime changes to state.json — simulates the supervisor tick path; verifies stale
phase=review-pending+ merged PR flips task to done. _apply_reconcile()ordering (Part B test): withset_pr_url+new_statusboth set, verify thepr_urlwrite happens BEFORE thestatuswrite in the_run_fleetcall sequence. Asserts the codex-surfaced ordering invariant.
Integration test: spin up a fake-gh subprocess fixture; assert one gh call per stuck task per tick (no duplicate calls).
Out of scope (file as P2/P3 follow-ups if needed)¶
- Reconcile via merge-commit message parsing (rescues squash-merge tasks whose branch was deleted before reconcile sees them).
- Reviewer/finisher contract hardening so the off-rails path can't happen (a deeper fix; this PR is the recovery net).
- TUI banner showing "stuck-without-pr" tasks.
review_handoffs_dispatcheddedupe-gate stale-agent recovery (codex called this out atloop.py:3779; track as separate task — the branch fallback covers the symptom, the dedupe-gate hardening prevents the silent re-dispatch).
Changelog¶
- 2026-05-18 v2 — Codex round 1 folded:
- Lookup uses
--head <branch>not--search head:<branch>;--limit 10and pick newest bycreatedAt. - Fallback must run BEFORE the
review-pending/review-doneshort-circuit atloop.py:2346(otherwise stuck-handoff tasks remain invisible). - Added Part B:
_apply_reconcile()ordering fix (pr_url before status). Codex surfaced the crash window. Bundled into this PR because the fallback writes both fields. - Added Part C: clarifies that periodic supervisor ticks pick up the fix via
_reconcile_inflight()— no supervisor-side change needed. - Test plan expanded from 7 → 11 tests; tests #1, #3, #10, #11 are codex-driven additions.
- Filed follow-up:
review_handoffs_dispatchedstale-agent recovery (separate from PR recovery). - 2026-05-18 v1 — initial draft, operator-driven during dd05ec05 coord session.