ADF Worktree Lifecycle Hardening -- Design Document
Gitea epic: terraphim/terraphim-ai#1567
Phase: 2 of disciplined development (Design)
Prerequisite: docs/design/adf-worktree-lifecycle-research.md (Phase 1, 1045 lines)
Author: claude-opus-4-7 (design agent)
Date: 2026-05-17
This document defines the implementation plan for the four-layer worktree-lifecycle hardening epic. No code is changed; no PR is opened. Implementation (Phase 3) starts only after human approval lands on epic #1567.
1. Overview and approach
The bigbox storm of 2026-05-17 (research section 1.1) compounded two independently-correctable defects into a feedback loop that produced 692 leaked git worktrees occupying 48 GiB. The fix is layered. No single patch is sufficient; each layer addresses a failure mode the others cannot.
| Layer | Issue | Defends against | Process context |
|------:|------:|-----------------|-----------------|
| 0 | #1562 | Schedule cursor never advancing under reconcile-timeout cancellation. Caps the per-storm fire rate from "every 90 s" to "every cron occurrence". | orchestrator process |
| 1 | #1569 | await-after-cleanup pattern. RAII WorktreeGuard runs cleanup from Drop on every exit path including future cancellation. | orchestrator process |
| 2 | #1570 | Residue from SIGKILL / OOM / panic-across-runtime that cleared the orchestrator process before any Drop could fire. | orchestrator process (user alex) |
| 3 | #1571 | Root-owned files inside worktrees that the orchestrator user cannot delete. | systemd ExecStartPre, root |
The layers compose: Layer 0 reduces blast radius for any Layer 1 regression by ~200x (research section 6.1); Layer 1 plugs the live leak; Layer 2 cleans residue at every restart; Layer 3 cleans residue the unprivileged sweep cannot. Each layer is independently mergeable and each carries its own acceptance test.
This design treats the eight open questions in research section 5 as resolved per the decisions captured in the task brief and section 2 below. Where additional sub-decisions emerged during design they are recorded inline with the section that introduces them.
The design is conservative on every axis: synchronous Drop
(established precedent at worktree_guard.rs:69), the existing
std::process::Command pattern (already used by
scope.rs:316-337), no new crate dependencies, no async-trait
machinery, no new schedulers. The single new concurrency primitive
is tokio::task::JoinSet from the existing
tokio = { features = ["full"] } in
crates/terraphim_orchestrator/Cargo.toml:23.
2. Scope
2.1 In scope
- Layer 0 (#1562): New cursor field
last_compound_review_fired_atonOrchestrator; cursor advance inside the fire branch atlib.rs:7148-7161; gate against the cursor on subsequent iterations. - Layer 1 (#1569): New git-aware
WorktreeGuardconstructor;WorktreeManager::create_worktreereturns the guard;compound.rsswarm spawn refactored toJoinSetwithkill_on_drop(true); guard declared before theJoinSetso Drop ordering is provable. - Layer 2 (#1570): New synchronous
WorktreeManager::sweep_stalemethod invoked fromOrchestrator::new; sweep covers both<worktree_root>/review-*and/tmp/adf-worktrees/*roots. - Layer 3 (#1571): New POSIX shell script
scripts/adf-setup/adf-cleanup.sh; corresponding shell-test underscripts/adf-setup/tests/;docs/operations/snippet for the systemdExecStartPre=line. - Cross-cutting: One new constant
WORKTREE_REVIEW_PREFIXinscope.rsso Layer 3 and Layer 2 reference the same literal; new tracing fields (swept_count,backlog_count,root_owned_skipped); updatedWorktreeGuardunit tests; newcompound.rscancellation property test; new shell test foradf-cleanup.sh.
2.2 Out of scope
- Migrating
WorktreeGuardto asyncDrop(unstable; research 3.1). - Replacing the
croncrate (research 7.2). - Moving compound review out of
reconcile_tick(research 7.1). - Per-agent worktree namespacing changes (research 7.3).
sccachelifecycle hooks (research 7.4).- Compound-review single-flight backpressure beyond the cursor fix (research 7.5).
- Reworking the per-agent path in
lib.rs:5392-5438to use the new git-aware guard constructor. That path already disarms its guard on success and explicitly callsremove_agent_worktree; folding it in expands review surface and risks regressing a working path. Recommendation: file a follow-up sibling issue if desired (research 7.6).
2.3 Avoid at all cost
- Async
Drop: would require nightly and break the stable-Rust guarantee. tokio::spawnfrom insideDrop: brittle at shutdown when the runtime handle may be gone (research 3.2, strategy D).- Detached cleanup queue: adds a moving part that needs separate visibility and complicates shutdown drain semantics (research 3.2, strategy C). The synchronous strategy is precedented and sufficient.
- Changing the worktree path convention: would invalidate all
three downstream layers at once (research 6.3). The literal
review-and/tmp/adf-worktrees/are frozen by this design. - Implicit reliance on
kill_on_dropas the sole subprocess kill path. The design useskill_on_drop(true)andJoinSet::aborton the wrapping tasks, in that order, and documents why both are needed.
2.4 Decisions made on research section 5 open questions
The task brief resolves the open questions. Recorded here for the reader who reviews this document in isolation:
| Research question | Resolution | Reference |
|-------------------|------------|-----------|
| 5.7 git worktree remove --force vs filesystem removal? | Hybrid: synchronous git worktree remove --force first, std::fs::remove_dir_all as fallback on non-zero exit. Layer 2's sweep_stale then runs git worktree prune. | Step 4 |
| 5.8 Cancellation propagation? | kill_on_drop(true) on Command at compound.rs:490 and JoinSet replaces the bare tokio::spawn loop. Aborting the JoinSet drops the wrapping tasks, which drop the Child handles, which now kill the subprocesses thanks to kill_on_drop. | Steps 5-7 |
| 5.3 Keep WorktreeGuard::keep()? | Yes. Cost is one method; future PR-review and product-development agents may need worktree hand-off semantics. YAGNI-allowed. | Step 3 |
| 5.2 Sweep scope: review-* only? | Both prefixes: <worktree_root>/review-* AND /tmp/adf-worktrees/*. Bigbox git worktree prune --verbose showed stale .git/worktrees/sentinel-* admin refs. | Step 10 |
| 5.1 / 5.6 adf-cleanup.sh source-of-truth location? | scripts/adf-setup/adf-cleanup.sh. Matches the existing scripts/adf-setup/ deployment vocabulary (TOML configs plus Python migrators live there today). | Step 13 |
| 5.4 Sweep timing? | From Orchestrator::new (synchronous), before construction returns. The sweep finishes before any tick thread is spawned (which happens later, in run() at lib.rs:1218-1225). Acceptance: zero review-* directories visible immediately after Orchestrator::new returns. | Step 11 |
2.5 Note on the brief's start_dual_mode reference
The task brief mentions Orchestrator::start_dual_mode as the
post-condition surface for Layer 2. No such method exists in
crates/terraphim_orchestrator/src/lib.rs (verified by grep). The
analogous async entry point is Orchestrator::run at lib.rs:1053.
The design proceeds against Orchestrator::new (sync) for the sweep
call site, which is strictly stronger: sweep completes before any
async work begins.
3. Step-by-step implementation plan
Each step is intended to map to one reviewable commit. Steps within a layer must merge in the order shown; layers themselves can land independently (see section 7 rollout sequence).
Layer 0 (issue #1562) -- schedule cursor
Step 1. Add cursor field to Orchestrator struct.
- File:
crates/terraphim_orchestrator/src/lib.rs. - Edit at
:241(afterlast_cron_fire): addlast_compound_review_fired_at: Option<chrono::DateTime<chrono::Utc>>. - Initialise at
:817(alongsidelast_tick_time):None. - Dependencies: none.
- Tests: covered by Step 2's test.
- Rollback: delete the field; the rest of the codebase has no other reference yet.
Step 2. Advance cursor inside fire branch at
lib.rs:7137-7161; gate subsequent iterations against it.
- File:
crates/terraphim_orchestrator/src/lib.rs. - Replace the
should_fireblock with a fire-time computation mirroring the per-agent loop at:7110-7130. - Cursor is updated before
handle_schedule_eventis awaited, guaranteeing the next iteration sees the new value even if the reconcile-timeout wrapper cancels the future mid-await. - Dependencies: Step 1.
- Tests: new
test_compound_review_cursor_advances_on_cancellationinlib.rstest module. Plant alast_tick_timevalue in the past via the existingset_last_tick_timehelper at:7712; install a cron expression that fires "1 second ago"; callcheck_cron_schedulesonce; verifylast_compound_review_fired_atisSome(_); call again and verify the field did not change. - Rollback: revert to the previous
should_fireblock; cursor field becomes dead.
Layer 1 (issue #1569) -- RAII WorktreeGuard in compound
Step 3. Add WorktreeGuard::for_managed constructor and extend
internal cleanup to use git-aware teardown.
- File:
crates/terraphim_orchestrator/src/worktree_guard.rs. - Add a new field
repo_path: Option<PathBuf>to the struct. WhenSome,cleanuprunsgit -C <repo> worktree remove --force <path>synchronously viastd::process::Command::new("git").status(), falling back tostd::fs::remove_dir_allon non-zero exit or executable-not-found. - Add
pub fn for_managed(repo_path: impl AsRef<Path>, worktree_path: impl AsRef<Path>) -> Self. - Keep
pub fn new(path: impl AsRef<Path>) -> Selfsemantically unchanged:repo_path = None, falls through to the existing filesystem-only removal path. The per-agent caller inlib.rs:5392-5438is unaffected. - Keep
pub fn keep(mut self): still setsshould_cleanup = false. Documented for future hand-off cases per research 5.3. - Dependencies: none.
- Tests: extend the
mod testsblock:test_managed_guard_invokes_git_remove.test_managed_guard_fallback_on_git_failure.test_managed_guard_keep_disarms.
- Rollback: delete
for_managedandrepo_path; the existing per-agent path is unchanged.
Step 4. Change WorktreeManager::create_worktree to return the
guard.
- File:
crates/terraphim_orchestrator/src/scope.rs. - New return type:
Result<WorktreeGuard, std::io::Error>. - The existing implementation at
:260-301is preserved; the only change is the return value. - Add
pub const WORKTREE_REVIEW_PREFIX: &str = "review-";at module scope. Single source of truth for Layer 2 and Layer 3. - Dependencies: Step 3.
- Tests: update existing
test_create_worktree,test_remove_worktree,test_cleanup_alltests to receive a guard and call.keep()where the test wants to inspect the worktree before removal. - Rollback: revert the return type.
Step 5. Refactor compound.rs:296-370 to hold the guard
declaratively and use JoinSet for the swarm.
- File:
crates/terraphim_orchestrator/src/compound.rs. - The guard is declared first (so it drops last).
- A
JoinSet<AgentResult>is declared second (so it drops first on cancellation; aborts every task before the guard's Drop runs). - The bare
tokio::spawnloop at:319-330becomestasks.spawn(async move { run_single_agent(...).await }). - The collection loop at
:341-365becomestokio::time::timeout_at(collect_deadline, tasks.join_next()).Ok(Some(Err(join_err)))logs atwarnand continues;Ok(None)exits the loop normally. - The explicit cleanup at
:367-370is removed; the guard'sDropperforms it. - Dependencies: Step 4.
- Tests: see Step 8.
- Rollback: revert to the previous explicit cleanup and switch back
to
tokio::spawn.
Step 6. Add kill_on_drop(true) to the Command builder at
compound.rs:490-528.
- File:
crates/terraphim_orchestrator/src/compound.rs. - Single-line change:
cmd.kill_on_drop(true);immediately after thetokio::process::Command::new(cli_tool)constructor. - Without it, aborting the JoinSet wrapping task drops
Childbut does not kill the underlying process (research 3.1). - Dependencies: Step 5.
- Rollback: remove the single line.
Step 7. Document the Drop ordering invariant in compound.rs.
- File:
crates/terraphim_orchestrator/src/compound.rs. - Add a multi-line comment immediately above the guard / JoinSet declarations. Inverting the order recreates the bigbox storm race.
- Dependencies: Steps 5-6.
- Tests: review-time documentation; not testable.
- Rollback: remove the comment.
Step 8. Add cancellation-property integration test.
- New file:
crates/terraphim_orchestrator/tests/compound_cancellation_test.rs. - Test outline:
- Set up a real git repo in a
TempDir(re-usescope.rs::tests::setup_git_repo, promoted topub(crate)). - Construct a minimal
SwarmConfigwhose single review group invokescli_tool = "/bin/sleep". Gives a long-lived subprocess that does not exit on its own. - Spawn
workflow.run("HEAD", "HEAD~1")inside atokio::spawn; capture itsJoinHandle. - Wait 200 ms for the worktree to be created and the agent subprocess to be live.
- Capture the agent subprocess PID by listing PIDs whose
cwdis the worktree. - Call
handle.abort()on the outer task; awaithandle. - Within 2 s, assert:
- The worktree directory under
<base>/review-*is gone. - The
.git/worktrees/review-*admin entry is gone. - The agent PID is no longer alive (
kill(pid, 0)returnsESRCH).
- The worktree directory under
- Set up a real git repo in a
- Variant: deliberately reintroduce the Layer 0 cursor bug (call
check_cron_schedulestwice without advancinglast_tick_time); assert no worktree directory exists at the end. - Dependencies: Steps 5-7.
- Rollback: delete the file.
Layer 2 (issue #1570) -- startup sweep
Step 9. Add WorktreeManager::sweep_stale method.
- File:
crates/terraphim_orchestrator/src/scope.rs. - Synchronous signature:
pub fn sweep_stale(&self, extra_roots: &[PathBuf]) -> SweepReport. SweepReport:pub structwithswept_count,failed_count,root_owned_skipped,prune_succeeded,duration_ms.- Behaviour:
- For
self.worktree_base, walk direct children whose name starts withWORKTREE_REVIEW_PREFIX. For eachextra_root, walk all direct children. Invokegit -C <repo> worktree remove --force <child>; fall back tostd::fs::remove_dir_allon non-zero exit. - After walking, run
git -C <repo> worktree prune --verbose. EACCES/EPERMincrementsroot_owned_skippedwithout marking the sweep failed. Layer 3 catches these on the next service restart.
- For
- Emit a single
info!line at the end with allSweepReportfields plusbacklog_count = swept_count + root_owned_skipped. - Dependencies: Step 4.
- Tests:
test_sweep_stale_empty_dir.test_sweep_stale_no_base.test_sweep_stale_removes_review_prefix.test_sweep_stale_preserves_non_review_prefix.test_sweep_stale_runs_prune.test_sweep_stale_skips_root_owned(Linux + root only; gated by#[cfg_attr(not(target_os = "linux"), ignore)]and runtimegetuid() == 0check).
- Rollback: delete the method and struct.
Step 10. Wire sweep_stale into Orchestrator::new.
- File:
crates/terraphim_orchestrator/src/lib.rs. - Insert sweep call immediately after
compound_workflowis constructed, before theOk(Self { ... })block at:786-867. extra_roots = vec![PathBuf::from("/tmp/adf-worktrees")]. Literal matcheslib.rs:5393.- Add
pub fn worktree_manager(&self) -> &WorktreeManageraccessor onCompoundReviewWorkflowif it does not exist; trivial. - Dependencies: Step 9.
- Tests:
crates/terraphim_orchestrator/tests/sweep_on_startup_test.rspre-seeds threereview-*dirs under aTempDir, callsOrchestrator::new, asserts the directories are gone beforenewreturns. - Rollback: remove the call.
Layer 3 (issue #1571) -- root-privileged ExecStartPre
Step 11. Add scripts/adf-setup/adf-cleanup.sh.
- New file, POSIX
sh(not bash). Idempotent. Walk<WORKTREE_ROOT>/review-*and/tmp/adf-worktrees/*, invokegit worktree remove --forceper entry (fall back to recursive directory removal), thengit worktree pruneonce. Emit one summary line to stdout. - Env vars:
ADF_REPO_PATH,ADF_WORKTREE_ROOT,ADF_AGENT_TMP_ROOT(all overridable via systemdEnvironment=). - The literal prefix
review-cross-referencesWORKTREE_REVIEW_PREFIXinscope.rs. - Dependencies: Step 4.
- Tests: see Step 12.
- Rollback: delete the file.
Step 12. Add scripts/adf-setup/tests/test_adf_cleanup.sh.
- POSIX shell test driver. Pre-seeds three
review-*directories plus akeep-me/directory, runsadf-cleanup.sh, asserts:review-*gone.keep-me/preserved.git worktree listclean.- Second run is a no-op.
- Dependencies: Step 11.
- Rollback: delete the file.
Step 13. Add deployment doc snippet.
- New file:
docs/operations/adf-orchestrator-systemd.mdcontaining the[Service]fragment, install commands, and a note that an in-tree installer is not yet present. - Dependencies: Step 11.
- Rollback: delete the file.
Cross-cutting
Step 14. Add new tracing fields to existing sweep and cleanup events.
- Files:
crates/terraphim_orchestrator/src/scope.rs,crates/terraphim_orchestrator/src/worktree_guard.rs. - Fields per research 4.2:
swept_count,failed_count,root_owned_skipped,duration_mson sweep summary;cancellation_reason(optional) on guard's Drop. - Dependencies: Steps 3, 9.
- Rollback: remove the new fields.
Step 15. Update epic body in Gitea after approval.
- Not a code change. Comment on epic #1567 confirming the design is approved and link to this document.
- Dependencies: human approval.
4. Per-layer file change manifest
The diff sketches below are illustrative. Exact line numbers and imports may shift during implementation; the structural shape is the review surface.
4.1 Layer 0 (#1562) -- schedule cursor
// crates/terraphim_orchestrator/src/lib.rs -- struct, near :241
/// Per-agent last cron fire timestamp to prevent re-triggering
/// within same schedule window.
last_cron_fire: ,
+ /// Last compound-review fire time, used to gate the compound
+ /// schedule independently of `last_tick_time`. Mirrors the
+ /// `last_cron_fire` pattern for per-agent crons (#1562).
+ last_compound_review_fired_at: ,
/// Lazy-initialised Gitea tracker for gitea-issue pre-check.
pre_check_tracker: ,// crates/terraphim_orchestrator/src/lib.rs -- Self { ... }, near :817
last_cron_fire: new,
+ last_compound_review_fired_at: None,
pre_check_tracker: None,// crates/terraphim_orchestrator/src/lib.rs -- :7137-7161
// Also check compound review schedule
if let Some = self.scheduler.compound_review_schedule The cursor advance is synchronous and runs before any
.await. If handle_schedule_event is cancelled by the 90 s
reconcile-timeout wrapper, the cursor has already moved past the
fired occurrence; the next iteration's
take_while(|t| *t <= now) still returns the same fire_time but
the already_fired gate short-circuits the call.
4.2 Layer 1 (#1569) -- WorktreeGuard and compound.rs
// crates/terraphim_orchestrator/src/worktree_guard.rs -- struct
// crates/terraphim_orchestrator/src/worktree_guard.rs -- cleanup
// crates/terraphim_orchestrator/src/scope.rs -- module scope
use ;
use HashSet;
use ;
use ;
use Uuid;
+
+use crateWorktreeGuard;
+
+/// Directory name prefix for compound-review worktrees. Single
+/// source of truth for Layer 2 sweep and the Layer 3 shell script.
+/// Changes here must be mirrored in
+/// `scripts/adf-setup/adf-cleanup.sh`.
+pub const WORKTREE_REVIEW_PREFIX: &str = "review-";// crates/terraphim_orchestrator/src/scope.rs -- create_worktree
pub async // crates/terraphim_orchestrator/src/compound.rs -- run, replacing :296-370
// Order matters. `_guard` is declared BEFORE `tasks` so on
// `Drop` the JoinSet aborts every wrapping task (and via
// `kill_on_drop` every subprocess) BEFORE the guard's
// synchronous `git worktree remove` runs. Inverting this
// order recreates the storm race documented in the
// worktree-lifecycle research doc section 3.3.
- let worktree_name = format!;
- let worktree_path = self
+ let worktree_name = format!;
+ let _guard = self
.worktree_manager
.create_worktree
.await
.map_err?;
+ let worktree_path = _guard.path.to_path_buf;
+
+ let mut tasks: JoinSet =
+ new;
- // Channel for collecting agent outputs
- let = ;
-
- // Spawn agents in parallel
let mut spawned_count = 0;
for group in active_groups
- drop;
let mut agent_outputs = Vecnew;
let mut failed_count = 0;
let collect_deadline =
now + self.config.timeout + from_secs;
loop // crates/terraphim_orchestrator/src/compound.rs -- run_single_agent, near :490
let mut cmd = new;
+
+ // Ensure that dropping the `Child` handle kills the underlying
+ // subprocess. Without this, JoinSet::abort() + Child::drop()
+ // leaves the subprocess running until its own timeout (default
+ // 30 minutes; see SwarmConfig::timeout). Combined with the
+ // JoinSet abort, this gives cooperative-then-forceful shutdown
+ // on cancellation.
+ cmd.kill_on_drop;Note also: the use tokio::sync::mpsc; line at compound.rs:4 can
be removed once the channel-based collection is gone (clippy
unused-import).
4.3 Layer 2 (#1570) -- startup sweep
// crates/terraphim_orchestrator/src/scope.rs -- new public API
+ /// Summary of a sweep_stale invocation. Emitted via structured
+ /// tracing so Quickwit can compute backlog gauges and alert on
+ /// large residue.
+
+
// crates/terraphim_orchestrator/src/lib.rs -- Orchestrator::new, near :786
// ... existing construction code, including `compound_workflow` ...
+ // Sweep any worktree residue left by a previous instance
+ // before we start servicing schedules. Synchronous; finishes
+ // before any tick thread is spawned in `run()`.
+ // Per-agent root `/tmp/adf-worktrees` is hard-coded to match
+ // `create_agent_worktree` at lib.rs:5392-5438.
+ let sweep_report = compound_workflow
+ .worktree_manager
+ .sweep_stale;
+ if sweep_report.swept_count + sweep_report.root_owned_skipped > 10
Ok4.4 Layer 3 (#1571) -- root-privileged sweep
# scripts/adf-setup/adf-cleanup.sh
#!/bin/sh
# adf-cleanup.sh -- pre-start sweep of stale ADF worktrees.
#
# Invoked by systemd as `ExecStartPre=` for adf-orchestrator.service.
# Runs as root so it can reclaim worktree contents owned by
# sub-process container builds and other elevated agents.
#
# Cross-reference: WORKTREE_REVIEW_PREFIX in
# crates/terraphim_orchestrator/src/scope.rs. The literal "review-"
# below must stay in sync with that constant.
ADF_REPO_PATH=""
ADF_WORKTREE_ROOT=""
ADF_AGENT_TMP_ROOT=""
swept=0
failed=0
# 1. Compound review residue under ${ADF_WORKTREE_ROOT}/review-*.
if [; then
for; do
[ || continue
done
fi
# 2. Per-agent residue under /tmp/adf-worktrees/*.
if [; then
for; do
[ || continue
done
fi
# 3. Reconcile git's admin registry. Failure here is not fatal.
||
# scripts/adf-setup/tests/test_adf_cleanup.sh
#!/bin/sh
THIS_DIR=""
CLEANUP_SH="/../adf-cleanup.sh"
TMP=""
REPO="/repo"
WT_ROOT="/.worktrees"
for; do
done
[ || { ; ; }
ADF_REPO_PATH="" \
for; do
if [; then
fi
done
[ || { ; ; }
# Idempotency: second run.
ADF_REPO_PATH="" \
# docs/operations/adf-orchestrator-systemd.md (drop-in snippet)
# Add to /etc/systemd/system/adf-orchestrator.service.d/cleanup.conf:
[Service]
Environment=ADF_REPO_PATH=/data/projects/terraphim/terraphim-ai
Environment=ADF_WORKTREE_ROOT=/data/projects/terraphim/terraphim-ai/.worktrees
ExecStartPre=/opt/ai-dark-factory/bin/adf-cleanup.shDeployment commands (manual until a proper installer lands):
5. Test strategy
5.1 Layer 0 (#1562) test strategy
Two test surfaces:
- Unit test in
lib.rstest moduletest_compound_review_cursor_advances_on_cancellation. Uses the existingset_last_tick_timetest helper (lib.rs:7712) to plant alast_tick_timevalue in the past. Callscheck_cron_schedulesonce and verifieslast_compound_review_fired_atisSome(_). Calls again and verifies the field did not change. - Property test colocated with Step 8's
compound_cancellation_test.rs. Forces the storm shape: plant a pastlast_tick_time, call the schedule check, cancel the resulting future before the cursor advance would have happened under the buggy code path, verify the cursor still moved. This test deliberately re-introduces the bug shape so it acts as a regression bell.
5.2 Layer 1 (#1569) test strategy
Three test surfaces:
worktree_guard.rsunit tests, augmenting the existing four-test module:test_managed_guard_invokes_git_remove.test_managed_guard_fallback_on_git_failure.test_managed_guard_keep_disarms.
scope.rstest updates for the new return type. Existing tests at:609-786useworktree_pathdirectly; update to take the guard and.keep()when inspection follows.- Cancellation property test
(
compound_cancellation_test.rs, Step 8). The headline acceptance test. It encodes the full property:- Real git repo, real worktree, real subprocess (no mocks).
- Parent task aborted at an arbitrary
.awaitpoint. - Within 2 s: worktree dir gone,
.git/worktrees/<name>gone, subprocess PID dead.
Variant of the cancellation test with Layer 0 deliberately broken
(planted past last_tick_time without a cursor field) asserts the
storm shape -- repeated firing -- produces zero leaked
worktrees with Layer 1 in place. This is the property the user's
runbook will exercise on bigbox.
5.3 Layer 2 (#1570) test strategy
scope.rsunit tests forsweep_stale:test_sweep_stale_empty_dir.test_sweep_stale_no_base.test_sweep_stale_removes_review_prefix.test_sweep_stale_preserves_non_review_prefix.test_sweep_stale_runs_prune.test_sweep_stale_skips_root_owned(Linux + root only; gated by#[cfg_attr(not(target_os = "linux"), ignore)]and a runtimegetuid() == 0check that skips with explanation if not root).
- Integration test in
tests/sweep_on_startup_test.rs: pre-seed threereview-*dirs under aTempDir-rooted config; callOrchestrator::new; assert dirs are gone beforenewreturns.
5.4 Layer 3 (#1571) test strategy
scripts/adf-setup/tests/test_adf_cleanup.sh(Step 12): the shell test driver described above. Run from CI via a Makefile target or directly.- CI invocation: add the shell test to whatever CI step runs
the Python tests under
scripts/adf-setup/tests/(currentlytest_migrate.pyand friends). - Manual bigbox verification (section 8 runbook): pre-seed a
stale
review-*directory, runsystemctl restart adf-orchestrator, confirm theExecStartPreline in the unit's journald output includes a non-zerosweptcount.
5.5 Cross-cutting property tests
One acceptance property must hold across all four layers and is verified by the cancellation test in Step 8 plus the bigbox runbook:
With the schedule-cursor bug deliberately re-introduced (i.e. Layer 0's cursor field removed and the old
should_fireblock reverted), the parent task being cancelled at any.awaitpoint must leave zero worktree directories on disk and zero agent subprocesses alive.
This property is what the storm violated. It is encoded as a deterministic test rather than relying on bigbox to surface regressions days later.
6. Traceability matrix
| Step | Layer | Gitea issue | Acceptance criterion (epic #1567) | Verifying test(s) |
|-----:|------:|------------:|------------------------------------|--------------------|
| 1 | 0 | #1562 | "Cursor field exists on Orchestrator." | unit: field-init test |
| 2 | 0 | #1562 | "Compound schedule fires at most once per cron occurrence under reconcile-timeout cancellation." | unit: test_compound_review_cursor_advances_on_cancellation; property variant in Step 8 |
| 3 | 1 | #1569 | "WorktreeGuard invokes git worktree remove --force for managed worktrees." | unit: test_managed_guard_invokes_git_remove, test_managed_guard_fallback_on_git_failure |
| 4 | 1 | #1569 | "WorktreeManager::create_worktree returns the guard; constant WORKTREE_REVIEW_PREFIX exists." | existing scope tests updated; compile-time check on the constant |
| 5 | 1 | #1569 | "Compound swarm uses JoinSet; explicit cleanup removed; guard owned by run." | property: compound_cancellation_test::test_cancellation_leaves_no_worktree |
| 6 | 1 | #1569 | "kill_on_drop(true) on agent Command." | property: same test, subprocess-PID assertion |
| 7 | 1 | #1569 | "Drop ordering documented in code." | review-time only |
| 8 | 1 | #1569 | "Cancellation leaves zero worktrees and zero subprocesses." | compound_cancellation_test.rs |
| 9 | 2 | #1570 | "sweep_stale exists, covers both roots, returns a report." | unit: five-plus test_sweep_stale_* tests in scope.rs |
| 10 | 2 | #1570 | "Sweep runs from Orchestrator::new before any await." | integration: sweep_on_startup_test.rs |
| 11 | 3 | #1571 | "adf-cleanup.sh cleans review-* and /tmp/adf-worktrees/* idempotently." | shell: test_adf_cleanup.sh |
| 12 | 3 | #1571 | "Shell test covers happy path + idempotency." | shell: test_adf_cleanup.sh |
| 13 | 3 | #1571 | "Deployment doc exists; ops team has copy/paste commands." | review-time only |
| 14 | x-cut | #1567 | "Tracing fields present for Quickwit ingest." | review-time + manual log inspection |
| 15 | x-cut | #1567 | "Epic comment posted with approval link." | manual |
7. Rollout sequence and dependencies
Each layer is independently mergeable. Recommended commit / merge order, with rationale:
7.1 Order
- Layer 0 (#1562) -- first.
- Layer 3 (#1571) -- second.
- Layer 1 (#1569) -- third.
- Layer 2 (#1570) -- last.
7.2 Rationale
- Layer 0 first because it caps blast radius. From the moment Layer 0 is deployed to bigbox, even an unrelated regression that drops a guard prematurely leaks one worktree per cron occurrence (24 h for nightly) rather than one per 90 s. The on-call team gets 200x more time-to-detect for everything that follows.
- Layer 3 second because it is hot-shippable defence: a shell
script and a systemd
ExecStartPre=line. It does not depend on Layer 1 or Layer 2 code changes and survives a worst-case regression of those. The script alone reduces the recovery runbook from research section 4.3 to a singlesystemctl restart. - Layer 1 third because it is the primary technical fix and the one most worth careful review. Landing it after Layer 0 means its test surface (the cancellation property test in Step 8) operates in a regime where the storm cannot happen even if the test regresses.
- Layer 2 last because it depends on the prefix constant from
Layer 1 (Step 4) and on the guard cleaning up the registry
(Layers 1 and 2 share the
WorktreeGuardAPI and theWORKTREE_REVIEW_PREFIXliteral).
7.3 Inter-dependencies summary
| Layer | Depends on |
|------:|------------|
| 0 | nothing |
| 3 | nothing in tree; reads review- literal from the script's own comment |
| 1 | Layer 0 not required but recommended (test isolation); uses WORKTREE_REVIEW_PREFIX |
| 2 | Layer 1's WORKTREE_REVIEW_PREFIX constant; Layer 1's guard API |
The only hard in-tree dependency is Layer 2 on Layer 1's
WORKTREE_REVIEW_PREFIX. If schedule pressure forces an
out-of-order merge, Layer 2 can either inline the literal or land
the constant in a tiny prerequisite commit.
8. Verification on bigbox post-deploy
After each layer's PR is merged and the orchestrator is redeployed to bigbox, run the corresponding verification step.
8.1 Layer 0 verification
# 1. Confirm the new struct field is reachable in the binary.
# expected: >= 1
# 2. Tail the journal and look for the "compound review schedule
# fired" line with the new `fire_time` field.
|
# expected: at most ONE log line per cron occurrence, not per 90 s.
# 3. Confirm cursor is persisted across reconcile-timeout
# cancellations.8.2 Layer 1 verification
# 1. Pre-seed the system with a known-good condition: zero
# review-* directories on disk.
# expected: 0
# 2. Trigger a compound review manually (or wait for the nightly
# cron). Watch the journal.
|
# 3. Confirm the worktree is created AND removed. Look for
# `worktree cleaned up via git` at the end of the review, with
# `duration_ms` < 5000.
# expected: 1 created, 1 cleaned up
# 4. Forced-cancellation test: simulate a reconcile timeout by
# sending SIGSTOP to the orchestrator for 100 s during a review.
# Then check: 0 leaked worktrees on disk, 0 stray `claude` /
# `opencode` PIDs in the cgroup.
# expected: Tasks well below 100, not 2 413.8.3 Layer 2 verification
# 1. Pre-seed three fake review-* directories before restart.
# 2. Restart the orchestrator.
# 3. Within 5 s, confirm the directories are gone and the journal
# shows the `worktree sweep_stale complete` event with
# swept_count=3.
# expected: 0
# expected: one line with swept_count=3 prune_succeeded=true8.4 Layer 3 verification
# 1. Pre-seed a root-owned residue (simulates the bigbox failure
# mode that Layers 1 and 2 cannot reach).
# 2. Restart.
# 3. Confirm the `ExecStartPre` line in the journal.
# expected: "adf-cleanup: swept=1 failed=0 repo=/data/..."
# 4. Confirm the residue is gone.
# expected: empty8.5 Whole-epic acceptance
After all four layers are deployed, run the user's stated acceptance criterion:
# Manually introduce a bunch of stale review-* directories.
# expected: 0 (all reclaimed by a single systemctl restart, no
# human-typed cleanup commands)9. Risk register
Top risks ranked by severity. Each carries an explicit mitigation captured in the design.
| # | Risk | Severity | Mitigation |
|--:|------|----------|------------|
| 1 | Synchronous Drop invoking git worktree remove parks a tokio worker for ~1.5 s. With 692 leaked trees a startup sweep could park the whole pool. | Medium | Layer 2's sweep_stale is synchronous and runs from Orchestrator::new -- explicitly outside any tokio runtime. Layer 1's per-call Drop parks one worker per cancelled review at most. Sustained worker starvation is not possible under steady-state operation. Documented in research section 3.2. |
| 2 | git worktree remove --force races with a still-live subprocess writing into the worktree. Subprocess may see file-not-found errors or write into a now-anonymous inode. | Low (on Linux) | Drop ordering (Step 7): JoinSet aborts before the guard's git-remove runs. kill_on_drop(true) then sends SIGKILL to each subprocess when the runtime polls the aborted tasks. On Linux a recursive directory removal unlinks dentries; subprocesses with open file descriptors continue writing to anonymous inodes that are reclaimed when the descriptor closes. State this explicitly in code comments and PR description. |
| 3 | sweep_stale removes a worktree that another orchestrator instance is using (e.g. during a botched dual-deploy). | Low | The systemd unit has Type=simple with no instance multiplexing; on bigbox only one orchestrator runs at a time (mutex enforced by the unit). Documented in research section 3.5.3. If multi-instance is ever needed, sweep gains a pidfile-aware liveness check; not required today. |
| 4 | kill_on_drop(true) interacts badly with a deliberate worktree hand-off case (e.g. future PR-review agents that hand off the worktree to a follow-on subprocess). | Low | The hand-off pattern is not implemented today. When it is implemented, the relevant call site uses WorktreeGuard::keep() to disarm the guard and does not rely on Child being held in a JoinSet wrapper. Documented as a YAGNI-allowed surface (section 2.2). |
| 5 | Layer 3's shell script's literal review- drifts from WORKTREE_REVIEW_PREFIX in scope.rs. | Low | The constant carries a documentation comment requiring synchronised changes; the shell script carries the same comment in reverse direction. A future test could grep -q '"review-"' scope.rs from the shell test to catch drift mechanically; recommended but not in scope. |
| 6 | Orchestrator::new blocks longer than systemd's TimeoutStartSec if sweep_stale finds thousands of stale entries. | Low | The cardinality is bounded by disk capacity; even 1 000 entries at 1.5 s each is ~25 minutes -- worse than systemd's default 90 s TimeoutStartSec. Mitigation: emit a warn! if swept_count > 100 and document raising the unit's TimeoutStartSec for the first post-storm restart only. After the first restart, residue is bounded by Layer 1's effectiveness. |
10. Out of scope (deferred)
Carrying forward from research section 7 and adding items that emerged during design:
- Async
Drop(research 7.1). Unstable; revisit if Rust stabilisesAsyncDrop. - Switching the
croncrate (research 7.2). - Per-agent worktree namespacing (research 7.3).
sccachelifecycle hooks (research 7.4).- Compound-review backpressure beyond cursor advance (research 7.5).
- Existing per-agent guard registry leak (research 7.6; Step
3 rationale): folding the per-agent path at
lib.rs:5392-5438onto the newfor_managedconstructor would close the registry leak but expands review surface for #1569. Recommended as a sibling follow-up issue. - In-tree systemd installer:
scripts/adf-setup/deploys TOML and Python today; introducing a real install script is a larger ops investment. Deferred to a dedicated issue, documented in Step 13. - Sweep-vs-shell-script literal drift test: a grep-based check
that the shell script's
review-*andscope.rs'sWORKTREE_REVIEW_PREFIXmatch. Recommended but adds shell-test complexity; deferred. - CI sweep test as non-root: a robust permission-denied test
for
sweep_staleis hard to write in a portable CI environment without root. Manual verification on bigbox covers it (section 8.4).
11. Approval gate
Implementation (Phase 3) must not begin until human approval lands on epic terraphim/terraphim-ai#1567.
Approval is a comment on the epic confirming:
- The decisions in section 2.4 (research section 5 resolutions) are accepted.
- The rollout order in section 7 is accepted (Layer 0 -> Layer 3 -> Layer 1 -> Layer 2).
- The acceptance property in section 5.5 is the right contract.
- The risk register in section 9 is acceptable (in particular risk #2 on the unlink-vs-live-write race, which is benign on Linux but non-obvious).
On approval, Phase 3 (Implementation) proceeds step-by-step per section 3, one commit per step, with each commit referencing the relevant Gitea issue:
- Steps 1-2 ->
Refs #1562. - Steps 3-8 ->
Refs #1569. - Steps 9-10 ->
Refs #1570. - Steps 11-13 ->
Refs #1571. - Steps 14-15 ->
Refs #1567.
The cancellation property test (Step 8) is the headline acceptance artefact; reviewers should expect to spend most of their review time there.
Phase 2 deliverable
This document is the Phase 2 deliverable. It captures the four-layer design, file-level diffs, test plans, traceability, rollout order, risk register, and the post-deploy verification runbook. No code is changed; no PR is opened.
Next step: human review and approval comment on epic terraphim/terraphim-ai#1567 at https://git.terraphim.cloud.