ADF Worktree Lifecycle Hardening -- Research Document
Gitea epic: terraphim/terraphim-ai#1567
Phase: 1 of disciplined development (Research)
Author: claude-opus-4-7 (research agent)
Date: 2026-05-17
This document concludes Phase 1. No code is changed; no design is committed. The four sibling issues (#1562 Layer 0, #1569 Layer 1, #1570 Layer 2, #1571 Layer 3) are referenced as scope anchors only.
1. Problem framing and storm timeline
1.1 What happened on bigbox
On 2026-05-17 (UTC) the adf-orchestrator.service host on bigbox entered a
self-reinforcing fire loop in which the compound review schedule fired
roughly every 90 seconds, spawning a fresh six-agent review swarm before the
previous one had a chance to post a verdict.
Observed end-state at intervention:
- Memory: 99.9 GiB / 100 GiB cgroup limit (OOM imminent).
- CPU: roughly four cores pinned for more than three hours.
- Active tasks in the service cgroup: 2 413.
- Leaked git worktrees on disk: 692, occupying 48 GiB under
/data/projects/terraphim/terraphim-ai/.worktrees/review-*. - Compound review issue #514 silent: the last verdict comment was timestamped
2026-05-13T10:01:45Z; no posts appeared during the storm window.
Log signature in journalctl -u adf-orchestrator:
... compound review schedule fired, starting review
... reconcile_tick exceeded timeout, forcing continuation
... compound review schedule fired, starting review
... reconcile_tick exceeded timeout, forcing continuation
... (repeating ~every 90 s for >3 h)Each compound review schedule fired line corresponds to a brand-new
review-<uuid> worktree created on disk, followed by six spawned agent
subprocesses, followed by tokio cancellation of the parent future before
either the cleanup line or the verdict-posting code ran. The 90 s cadence
matches the reconcile timeout: Duration::from_secs(tick_interval_secs.max(30) * 3)
at crates/terraphim_orchestrator/src/lib.rs:1245 with the default
tick_interval_secs = 30 (crates/terraphim_orchestrator/src/config.rs:1096-1098).
1.2 The two distinct root causes
Two bugs compound into the storm. They are tracked as separate issues so that mitigation can land independently and so that each can be reasoned about in isolation.
Cause A -- schedule cursor not advanced on cancellation (#1562, Layer 0)
The cron check at crates/terraphim_orchestrator/src/lib.rs:7148-7152 reads:
let should_fire = compound_sched
.after
.take_while
.next
.is_some;The cursor self.last_tick_time is only updated at step 14 of
reconcile_tick, at crates/terraphim_orchestrator/src/lib.rs:5687-5689:
// 14. Update last_tick_time and increment tick counter
self.last_tick_time = now;
self.tick_count = self.tick_count.wrapping_add;The tokio safety wrapper at crates/terraphim_orchestrator/src/lib.rs:1288-1296
cancels reconcile_tick after reconcile_timeout (90 s by default) and
continues the outer loop. Because cancellation drops the future before line
5687 is reached, last_tick_time retains the value set by the previous
successful tick, so on the next iteration the same past cron occurrence
satisfies should_fire again. Result: the schedule re-fires every tick.
Proposed mitigation (in the body of #1562) is to advance the cursor inside
the fire branch, before the long-running .await on handle_schedule_event,
so it survives cancellation. This is documented here as Layer 0 but the
detailed design is owned by #1562.
Cause B -- worktree cleanup is a line after await, not a Drop (this work)
The compound-review entry point creates a worktree at
crates/terraphim_orchestrator/src/compound.rs:298-304:
let worktree_name = format!;
let worktree_path = self
.worktree_manager
.create_worktree
.await
.map_err?;Cleanup occurs at crates/terraphim_orchestrator/src/compound.rs:367-370,
after the collection loop has either drained the channel or hit the
collection deadline:
// Cleanup worktree
if let Err = self.worktree_manager.remove_worktree.await When the outer future CompoundReviewWorkflow::run is dropped at any of the
intermediate .await points -- and there are many: create_worktree,
get_changed_files, the inner tokio::time::timeout_at in the collection
loop -- the cleanup line is never reached. There is no Drop impl to run as
a fall-back. Cause A turns this from a once-per-storm leak into a
once-per-90s leak, which is how 692 worktrees accumulated in three hours.
1.3 Scope of this research
This document focuses on Cause B. It treats Cause A as a sequencing prerequisite (because without #1562, even a perfect Layer 1 still creates and tears down worktrees on every fire) and surfaces the layers that make the cleanup robust under cancellation, panic, SIGKILL, and root-owned artefacts.
1.4 Four-layer scope as established by the epic
| Layer | Issue | Purpose | Runs as |
|------|------|---------|---------|
| 0 | #1562 | Schedule cursor advances inside the fire branch before .await. | orchestrator process |
| 1 | #1569 | RAII guard so worktree cleanup runs from Drop on every exit path including future cancellation. | orchestrator process |
| 2 | #1570 | WorktreeManager::sweep_stale invoked from Orchestrator::new to clean residue from SIGKILL / OOM / panic-across-runtime. | orchestrator process (user alex) |
| 3 | #1571 | adf-cleanup.sh deployed as ExecStartPre to sweep stale review-* and *-sentinel-* worktrees as root, handling root-owned files Layers 1 and 2 cannot. | systemd, as root |
Layers 1, 2, 3 are defence-in-depth: each handles a failure mode the layer above cannot. Layer 0 is upstream of all three; it reduces blast radius and buys time-to-detect for any Layer 1 regression.
2. Existing implementation map
This section cites file paths and line numbers; it does not propose code.
2.1 WorktreeManager -- crates/terraphim_orchestrator/src/scope.rs
The manager is a thin wrapper around the git worktree CLI.
WorktreeManagerstruct:crates/terraphim_orchestrator/src/scope.rs:206-210.- Constructors
newandwith_base::216-242.with_baseresolves a relativeworktree_baseagainstrepo_pathto avoid CWD-dependent behaviour (see commit622a78e88). create_worktree(async)::260-301. Shells out togit -C <repo> worktree add <path> <ref>, withGIT_INDEX_FILEremoved from the environment to defend against pre-commit-hook contamination of the index lock. Returnsio::Erroron git failure with stderr included.remove_worktree(async)::306-356. Shells out togit -C <repo> worktree remove <path>; falls back togit worktree remove --forceon first failure. Also attempts to remove the empty parent directory, ignoring the error.cleanup_all(async)::361-375. Walkslist_worktrees()and callsremove_worktreeon each. Returns the count of successful removals.list_worktrees(sync)::380-402. Reads the worktree base directory and returns names of subdirectories that contain a.gitfile or directory. Crucially, it does not consultgit worktree listor.git/worktrees/<name>registry entries. Thus an empty dir without.gitis invisible, as is a stale registry entry whose worktree dir was manually removed by hand.worktree_exists(sync)::405-407. Checks whether<base>/<name>/.gitexists.
Test coverage is thorough for the happy path -- test_create_worktree,
test_remove_worktree, test_remove_nonexistent_worktree,
test_cleanup_all, test_list_worktrees, test_worktree_exists,
test_create_duplicate_worktree_fails -- all in
crates/terraphim_orchestrator/src/scope.rs:609-786. There is no test that
exercises a partially-created worktree (dir present, .git missing) or a
registry-only orphan (no dir, .git/worktrees/<name> present).
2.2 WorktreeGuard already exists -- crates/terraphim_orchestrator/src/worktree_guard.rs
The WorktreeGuard module landed in commit f324183e4 ("phase 1 - Exit
Classifier, worktree guard, ES bulk ingest") and is currently used only
for the per-agent worktree path in lib.rs, not for the compound-review
worktree in compound.rs. The full file is 183 lines.
Key facts:
- Struct:
crates/terraphim_orchestrator/src/worktree_guard.rs:25-29. Holds aPathBufand abool should_cleanup. keep(disarm)::48-51. Setsshould_cleanup = false. Takesselfby value, which is fine for the existing caller pattern but worth noting -- the guard becomes unusable afterkeep().cleanup(private)::59-81. Callsstd::fs::remove_dir_all, notgit worktree remove. This is the most consequential finding in the existing implementation: the worktree directory is wiped, but the git internal registry under.git/worktrees/<name>is not.git worktree pruneis required to reconcile.Drop::84-88. Callscleanup()synchronously fromDrop.- Helpers
with_worktree_guardandwith_worktree_guard_async::93-111. The async variant takes a future by value but does not thread the guard into it; the guard is created in the outer scope and dropped when the helper returns. A cancellation of the outer future at the await still drops the guard via stack unwinding -- the guarantee is the standard one.
Tests exist for cleanup, keep, already-removed, and
with_worktree_guard -- :113-182. There is no test of cancellation
behaviour, nor of git registry residue.
2.3 Compound review entry point -- crates/terraphim_orchestrator/src/compound.rs
CompoundReviewWorkflow (struct :231-235) owns a SwarmConfig and a
WorktreeManager. The constructor from_compound_config at :248-250
builds the manager via WorktreeManager::with_base(&config.repo_path, &config.worktree_root).
run is the entry point -- :261-402. Step by step:
:266-275Generatecorrelation_id, log start.:277-294Callget_changed_files(:424-461, shells out togit diff --name-only base_ref git_ref). Filter groups byvisual_only.:296-304Create worktreereview-<correlation_id>. If this await is cancelled the worktree either never materialises or its creation error propagates out ofrunbefore any sender exists, so this single await is not in itself a leak source.:306-307Channel for agent outputs.:309-332Spawn each agent viatokio::spawn. Each task receives a clonedtx, a clonedworktree_path, a clonedchanged_files, and a clonedgroup.spawned_countis incremented in the loop body. TheJoinHandlereturned bytokio::spawnis dropped immediately. This is critical for risk analysis (see 3.3).:335drop(tx). The outer holder oftxis dropped so that, when all spawned senders eventually go away, the receiver returnsOk(None)and the loop exits.:336-365Collect with a deadline oftimeout + Duration::from_secs(10). Thetokio::time::timeout_atreturnsErron deadline, which logs"collection deadline exceeded, using partial results"and breaks the loop.:367-370Cleanup. This is the line that does not run on cancellation.:373-401Deduplicate findings, compute pass/fail, return result.
Per-agent execution lives in run_single_agent (:476-554). It builds a
tokio::process::Command with current_dir(worktree_path) (:523), runs
it under tokio::time::timeout (:537), and returns
AgentResult::Success or AgentResult::Failed. The timeout cancels the
cmd.output() future and the Child is dropped, but kill_on_drop is
not currently set on the Command (see compound.rs:490-528), so the
subprocess is orphaned and continues running until it exits of its own
accord. This is a hidden cost of the current code path and a key
constraint for any Phase 2 design (see 3.3).
2.4 Reconcile-loop wrapper -- crates/terraphim_orchestrator/src/lib.rs:1280-1300
The reconciliation loop body, with the 90 s safety wrapper:
Ok => reconcile_timeout is constructed at :1245:
let reconcile_timeout = from_secs;With tick_interval_secs = 30, the timeout is 90 s. The storm cadence
matches exactly.
2.5 Cron schedule check -- crates/terraphim_orchestrator/src/lib.rs:7098-7162
This is the call site that fires the storm. The relevant chunks:
:7099-7100Readnow = Utc::now().:7104-7125Per-agent cron filter: skip ifactive_agents.contains_key(name), computenext_fireusingschedule.after(&self.last_tick_time).next(), short-circuit ifnext_fire > now, additionally short-circuit iflast_cron_fire[name] >= next_fire. The last protection is what saves the per-agent cron path from the same bug -- it keeps an explicit per-agent fire cursor independent oflast_tick_time.:7127-7134Fire loop.last_cron_fire.insert(name, fire_time)happens beforespawn_agentis awaited, which is the pattern Layer 0 needs to replicate for compound review.:7137-7161Compound review check. Unlike the per-agent loop, it has no per-event cursor -- it readscompound_sched.after(&self.last_tick_time)and fires if that yields any time<= now. The "skip if active" check is absent.
The schedule type comes from the cron
crate, used via crates/terraphim_orchestrator/src/scheduler.rs:3 (use cron::Schedule;). Iterator Schedule::after(&datetime).take(n) returns
the next n occurrences strictly after datetime. The cron expression
"runs nightly at 02:00" decomposes to occurrences spaced 24 h apart -- so
once a single past occurrence is on the wrong side of the cursor, every
subsequent tick (until the cursor advances past it) re-detects it.
2.6 Other call sites of create_worktree
Two distinct lifecycle paths exist:
- Compound-review path (
crates/terraphim_orchestrator/src/compound.rs:298-304). Worktree dir under<config.worktree_root>/review-<full-uuid>. Defaultworktree_rooton bigbox is/data/projects/terraphim/terraphim-ai/.worktrees/. - Per-agent path (
crates/terraphim_orchestrator/src/lib.rs:5392-5438definescreate_agent_worktree; called at:2086). Worktree dir under/tmp/adf-worktrees/<agent_name>-<8char-uuid>(the root is hardcoded atlib.rs:5393). The correspondingWorktreeGuardis stored onManagedAgent::worktree_guard(lib.rs:182) and disarmed on success atlib.rs:6694-6697.
The per-agent path already has guard coverage; the compound-review path
does not. Note also that the per-agent path produces worktree directory
names whose prefix is the agent's own name -- so a pr-security-sentinel
spawn produces /tmp/adf-worktrees/pr-security-sentinel-<8>. This is the
likely origin of the "sentinel-*" pattern in the user's brief: it is not a
separate code path; it is the set of /tmp/adf-worktrees/*sentinel*
directories.
2.7 Runtime mode
#[tokio::main] at crates/terraphim_orchestrator/src/bin/adf.rs:246
selects the multi-threaded runtime (the default). This affects the
viability of synchronous blocking inside Drop (see 3.2).
3. Constraints and risks
3.1 Tokio cancellation semantics
When an async function is dropped at an await point, the following hold:
- Stack-allocated values that the future owns are dropped in reverse
construction order, including any
Dropimpls. tokio::spawned background tasks are not cancelled. TheirJoinHandles become "abandoned" but the tasks continue executing on the runtime until they complete or panic.- A
tokio::process::Childheld in the cancelled future is dropped, but this does not terminate the subprocess.kill_on_dropisfalseby default ontokio::process::Command; the documented behaviour matches the standard library'sstd::process::Child. The child process continues running, detached from any parent handle, until it exits on its own. - Subprocess termination therefore requires either explicit
kill_on_drop(true)on theCommandat spawn time, an explicit.kill().awaitbeforeChild::drop, or a unix process-group approach (setpgid+killpg). The currentcompound.rs:490-528builder does none of these.
This means a Drop impl on CompoundReviewWorkflow::run's state -- or
more practically on a guard owned by that state -- will fire when the
reconcile-loop wrapper cancels the future at 90 s. The remaining question
is what such a Drop is allowed to do; see 3.2.
Async Drop is unstable (still in nightly-only RFC state per the project's
stable-Rust constraint). Layer 1 must therefore use a synchronous Drop.
3.2 Sync Drop calling blocking git CLI -- the tradeoff matrix
Layer 1's Drop needs to remove a git worktree. Four implementation
strategies are worth enumerating in Phase 2; this document does not
choose between them.
| Strategy | Mechanism | Throughput cost | Failure modes |
|----------|-----------|----------------|---------------|
| A. std::process::Command::new("git").status() | Blocks the calling thread until git worktree remove returns. | On the multi-thread runtime, parks one worker for the duration; with 692 leaked trees, a startup sweep could park the whole pool. Per-Drop is small (single git invocation). On current_thread runtime, would deadlock. | git failure is observable (non-zero exit). |
| B. tokio::task::block_in_place | Tells the multi-thread runtime to move the current task elsewhere, then blocks; the runtime keeps making progress. | Same blocking work but does not starve the scheduler. Not available on current_thread runtime; would panic. | Same as A. |
| C. Detached cleanup queue | Drop sends a PathBuf to a long-lived mpsc::UnboundedSender; a dedicated worker drains it and runs git worktree remove. | Non-blocking inside Drop. Adds a moving part. Messages can be lost on process shutdown unless drained explicitly. | git failure is observed in the worker, not the calling task; need separate visibility. |
| D. Fire-and-forget tokio::spawn from Drop | Drop builds the command and spawns it. | Inherits "tokio::spawn from sync context needs a handle" -- only works if a runtime handle is in scope (via tokio::runtime::Handle::try_current). On shutdown the handle may be gone. | git failure is logged in the spawned task. |
The existing WorktreeGuard::cleanup at
crates/terraphim_orchestrator/src/worktree_guard.rs:69 already uses a
relative of strategy A -- synchronous filesystem ops
(std::fs::remove_dir_all) -- so the precedent for "we block inside
Drop" is established. The filesystem variant differs from git worktree remove in two ways:
- It does not invoke git, so it cannot reconcile
.git/worktrees/<name>. - It is bounded by inode count rather than git's internal locking, so long durations are rare.
A purely-filesystem cleanup in Drop, accompanied by a separate
git worktree prune in Layer 2's sweep, is one design point. Phase 2 must
weigh whether the registry residue between Drop and the next sweep is
acceptable (it is read-only metadata; git refuses to create a new worktree
at the same path, but new worktrees use uuids).
3.3 The detached tokio::spawn race -- highest risk to address in Phase 2
crates/terraphim_orchestrator/src/compound.rs:319-330 spawns each agent
in a detached task:
spawn;The JoinHandle is dropped at the end of the spawn statement. Consequences
when the outer run future is cancelled by the 90 s wrapper:
- Spawned tasks continue executing. Their inner
tokio::time::timeout(timeout, cmd.output())at:537only cancels the child afterself.config.timeoutelapses; defaultmax_duration_secs = 1800(30 min) gives an upper bound of 30 minutes during which up to six child processes are still running. - Each running task still holds a clone of
worktree_path, and its subprocess hascurrent_dir(worktree_path)(:523). It may write into the worktree at any moment. - If Layer 1's
Dropdeletesworktree_pathwhile a spawned task is mid-write, the agent's subprocess will see file-not-found errors or, on some filesystems, may continue writing into a now-orphaned inode. - The spawned task will then
tx.send(...).awaiton a closed channel, log a_ = ...discarded error, and exit. No findings will be collected. The verdict for that swarm is lost. - The agent subprocess itself continues running for up to
max_duration_secs(30 minutes by default), holding CPU and memory in the orchestrator cgroup. This is the immediate consequence of 3.1'skill_on_drop = falsefinding and is the direct mechanism by which the bigbox 2 413 task-count and 4-core CPU pin were reached: agent subprocesses from earlier review fires were still alive when the next fire spawned six more.
This is the primary correctness concern for Phase 2. Options to enumerate (without choosing):
abort()the join handles held in a vector thatDropwalks. Aborts each task at its next await point. By itself this does not kill the underlying subprocess (see 3.1); it merely cancels the wrapping tokio task and drops theChild. For this option to clean up subprocesses, the spawn site must additionally setkill_on_drop(true)on theCommand.- Propagate a
CancellationToken([tokio_util::sync::CancellationToken]) intoSwarmConfigand into each spawned task. Each task selects on its work future vs. the token, and on receiving the cancel signal callschild.kill().awaitbefore exiting. More plumbing but cooperative and explicit about subprocess termination. - Accept the race. Rely on
git worktree remove --forceatomicity (it doesmv+rm) and on agents being tolerant of disappearing CWDs. Highest risk; documented for completeness. Note that this option does not address the runaway-subprocess load problem at all. - Hold the spawn handles in the guard. When the guard is dropped,
it
aborts the handles, awaits a short grace period to allow subprocess teardown, then runsgit worktree remove --force. This is a refinement of (1) and is the most defensible default, but again requireskill_on_drop(true)(or explicit.kill()) at the spawn site for the subprocess to actually die.
In all four cases, Layer 1's Drop must not depend on cooperative agent
shutdown for correctness; the cleanup must succeed even when agents are
hostile. Phase 2 must explicitly plumb kill_on_drop(true) into the
spawn site at compound.rs:490 or commit to an alternative kill
mechanism; relying on drop semantics alone is incorrect under tokio's
documented default.
3.4 File-ownership constraint -- why Layer 3 must exist
On bigbox the post-incident inspection showed mixed file ownership inside
the leaked review-* worktrees:
- Source-tree files owned by
alex(the orchestrator service user). target/artefacts owned byalexorsccache, depending on whetherRUSTC_WRAPPER=sccachewas active (lib.rs:670-679enables it whensccache --versionsucceeds at orchestrator startup).- Sub-process workspaces and container artefacts owned by
root. The exact origin is project-specific (some agents elevate into container builds, some shell out to docker), but the empirical observation is thatfind /data/.../worktrees/ -uid 0returns non-zero results.
Implication: Layer 1's Drop and Layer 2's sweep, both running as alex,
cannot remove root-owned files. git worktree remove --force will
fail with Operation not permitted partway through. The worktree directory
will remain on disk with a partial structure.
Layer 3 (adf-cleanup.sh as ExecStartPre) runs in the systemd unit
context with root privileges. It can recursively delete entries under
<worktree_root>/review-* (using find ... -mindepth 1 -delete or
equivalent) and then git worktree prune. This is the only layer that
closes the root-owned residue gap.
Two design choices that follow from this constraint and need Phase 2 attention:
- Layer 1 and Layer 2 should treat
EACCES/EPERMas a warning, not an error. The orchestrator should log and move on; Layer 3 will clean up on the next service restart. - Layer 3's source-of-truth path matters. It must live under version
control so the deployed
ExecStartPreline never drifts from the cleanup logic. See open question 5.1.
3.5 Race conditions in detail
Beyond the spawn race (3.3), several other races deserve enumeration.
3.5.1 Sweep vs. live review
If WorktreeManager::sweep_stale runs in Orchestrator::new and the
process restarts during an in-flight compound review (e.g. via a deploy),
sweep will find one review-* directory whose owner is the just-killed
process. Sweep claims it as stale and deletes it.
Mitigation options:
- Age threshold: only sweep worktrees older than a configurable
duration (e.g. 2 x
max_duration_secs). Caveat: a fresh restart after an OOM will leave a worktree younger than the threshold, which sweep then skips -- defeats the purpose. - Process-liveness probe: read the orchestrator's previous pidfile; if the previous instance is dead, sweep. Adds a pidfile.
- Sweep before scheduler starts: ensure sweep finishes before the
scheduler's tick thread is spawned. This is sufficient because the
scheduler is the only path that creates new
review-*trees.
3.5.2 Sweep vs. per-agent worktrees in /tmp/adf-worktrees/
If Layer 2 also sweeps /tmp/adf-worktrees/*, it must skip agents that
are still listed in active_agents. Phase 2 must decide whether sweep
scope is review-* only, or review-* plus /tmp/adf-worktrees/*. See
open question 5.2.
3.5.3 Multiple orchestrator instances
The schema does not prevent two orchestrator processes from sharing the
same worktree_root -- e.g. during a hot-restart with Type=notify or
during a botched migration. Layer 2 has no lockfile. If two instances
sweep concurrently, both will see the same trees; the second sweep will
return "remove failed" warnings but is otherwise idempotent.
This is not a correctness bug but it is worth flagging because Layer 3
runs as ExecStartPre, and ExecStartPre runs before the main
ExecStart, so any overlap is bounded by the unit's restart semantics.
3.6 Idempotency requirements
Both WorktreeManager::sweep_stale and adf-cleanup.sh must be safe to
invoke when there is nothing to clean. Concretely:
- Empty
worktree_base(no directory at all) is not an error. - Empty
worktree_base/is not an error. git worktree pruneon a repo with no registry entries is a no-op.find ... -mindepth 1 -deleteon an empty directory is a no-op.
Layer 1's Drop-fired cleanup is inherently idempotent because
Drop only runs once per guard instance; the concern there is correctness
under repeated process-level invocation, which is addressed by Layers 2
and 3.
3.7 WorktreeGuard's existing gap -- registry residue
As noted in 2.2, the existing WorktreeGuard uses
std::fs::remove_dir_all rather than git worktree remove. This means
that today, even on the per-agent path that is supposedly already guarded:
- An agent panic or OOM removes the worktree dir.
- The git registry entry
.git/worktrees/<agent_name>-<uuid>is not cleaned up. - Over time the repo's
.git/worktrees/directory grows. git worktree listshows zombie entries;git worktree addat the same path fails with "already exists" untilpruneruns.
The user has not reported this as a visible problem -- presumably because
agent worktrees use uuid::Uuid::new_v4() for uniqueness, so the
"already exists" path is never hit. But the registry size is a slow leak
and Phase 2 should address it.
3.8 Tracing and observability of cleanup
scope.rs already emits structured tracing on create/remove:
:272-277info!oncreate_worktreeentry.:299info!on success.:292error!ongit worktree addfailure.:314info!onremove_worktreeentry.:341error!ongit worktree removefailure.:354info!on success.
worktree_guard.rs emits at lower verbosity:
:38debug!on guard creation.:50debug!onkeep().:65debug!on "already removed".:71info!on cleanup success.:74warn!on filesystem-remove failure.
compound.rs:367-370 emits warn! on cleanup failure. There is no
info! on cleanup success in the compound path, and no metric tracking
the worktree backlog. Phase 2 should add at minimum:
- A counter or gauge of
worktrees_swept_total/worktrees_activeexposed to the Quickwitadf-logsindex. - A
warn!log line ifsweep_stalefinds more than N residual worktrees at startup (a smoking gun for a recent crash storm).
4. Operational and observability requirements
4.1 What gets logged today
Cataloguing every tracing:: call in the worktree lifecycle (see 3.8 for
the full list). The summary picture:
- Create: visible.
info!withrepo_path,worktree_path,git_ref. - Normal removal: visible.
info!with name and path. - Forced removal: visible. The second invocation of
git worktree remove --forceis logged atinfo!on success anderror!on failure. - Drop-fired cleanup: visible but only at
debug/info.warn!on filesystem failure. - Storm conditions: no aggregate-level signal. The 692 leaked
worktrees on bigbox were discovered by
du -shafter the fact, not by an alert.
4.2 Required new signals for Phase 2
Three minimum additions:
- Backlog gauge. On every
sweep_staleinvocation, emitworktree_backlog{kind="review"} = Nandworktree_backlog{kind="agent"} = M. Quickwit'sadf-logsindex can alert when either exceeds a threshold (e.g. 10). - Sweep summary event. After
sweep_stalefinishes, emit oneinfo!line withswept_count,failed_count,duration_ms,root_owned_skipped, etc. This lets storm postmortems compute "how long was the backlog growing" by reading sweep logs alone. - Drop-fired cleanup event. Each Layer 1
Dropcleanup should emit a single structured event including the worktree name, the agent (if bound), the cancellation reason if known, and the success status. Today'sinfo!"worktree cleaned up" suffices on the happy path; thewarn!on failure should also include the partial-progress state (which step failed:git worktree remove, filesystem delete, orgit worktree prune).
4.3 Storm-recovery runbook -- current vs. target
Today's recovery requires a human to:
- Stop the service:
sudo systemctl stop adf-orchestrator. - Manually delete the residual worktree directories under
/data/.../worktrees/review-*using elevated privileges. - Reconcile git's registry:
cd /data/.../terraphim-ai && git worktree prune. - Sweep the per-agent root at
/tmp/adf-worktrees/(typicallysudo find /tmp/adf-worktrees/ -mindepth 1 -delete). - Restart:
sudo systemctl start adf-orchestrator.
Target (post-Layer-3) recovery: a single sudo systemctl restart adf-orchestrator. The unit's ExecStartPre=/opt/adf/bin/adf-cleanup.sh
runs as root before the main binary; the main binary's
Orchestrator::new calls sweep_stale for any residue not in scope of
the cleanup script.
Acceptance criterion for the epic: a manually-introduced bunch of stale
review-* directories is fully reclaimed by a single
systemctl restart, with no human-typed cleanup commands.
4.4 Quickwit adf-logs integration
The orchestrator already streams structured tracing into Quickwit via
crates/terraphim_orchestrator/src/quickwit.rs (573 lines; full review
out of scope here). Adding new fields to existing info! lines is
backwards-compatible. Adding a new event type (e.g.
worktree_sweep_completed) requires a Quickwit dashboard update but no
schema migration -- Quickwit's index is schema-on-read for non-mapped
fields.
5. Assumptions and open questions
Open questions identified during research. Phase 2 should resolve each before committing to a design.
5.1 Where should the adf-cleanup.sh source of truth live?
The script is deployed to /opt/ai-dark-factory/bin/adf-cleanup.sh (path
inferred from the existing adf-orchestrator.service's working directory
convention; not in tree). The repository today has no deploy/,
systemd/, or ops/ directory (ls of these returns "No such file").
Three plausible locations:
scripts/adf-cleanup.sh-- consistent with other operational scripts underscripts/ci/,scripts/adf-setup/, etc.scripts/adf-setup/cleanup.sh-- consistent with the existingscripts/adf-setup/agents/andscripts/adf-setup/scripts/directories.- A new
deploy/systemd/adf-orchestrator/adf-cleanup.shalongside a fragment of the unit file -- captures the deployment intent in tree.
Assumption: the script lives at scripts/adf-cleanup.sh until the
ops team specifies otherwise. The deployed copy must be a literal copy
of this file (or a symlink during dev), with no in-place edits on the
host.
5.2 Sweep scope: review-* only, or also *-sentinel-*?
Section 2.6 establishes that the per-agent path produces worktrees at
/tmp/adf-worktrees/<agent_name>-<8>, so any agent whose name contains
sentinel (e.g. pr-security-sentinel, security-sentinel) yields a
"sentinel" path on disk. The brief mentions sentinel-* as a separate
prefix but no separate code path produces it.
Options:
A. Compound only. Sweep <worktree_root>/review-* only. Per-agent
trees in /tmp/adf-worktrees/ are left to the existing per-agent
WorktreeGuard and to systemd's RuntimeDirectory cleanup. Smaller
blast radius; relies on the per-agent guard already being correct.
B. Compound + per-agent. Sweep <worktree_root>/review-* AND
/tmp/adf-worktrees/*. Two roots, similar logic. Catches the
per-agent registry residue noted in 3.7.
C. All .worktrees/. Sweep everything under
<worktree_root>/, including hypothetical future prefixes. Most
defensible against drift but risks deleting in-use trees from
other tools (no known case but cannot be ruled out).
Assumption: option B for Layer 2 (compound + per-agent), restricted
to review-* and entries created by create_agent_worktree. Layer 3's
shell-script equivalent should match. Documented as an open question
because the brief specified review-* and sentinel-* explicitly,
which is option B by intent but uses a different prefix vocabulary.
5.3 Is WorktreeGuard::disarm (currently keep()) needed?
The current callers either keep or drop the guard at deterministic points. The disarm pattern is useful when:
- The agent succeeds and we want to inspect its worktree out-of-band.
- The agent's output is being post-processed by another tokio task that owns the worktree.
The current code path in lib.rs:6690-6697 disarms the guard on agent
success ("Disarm worktree guard on success so it doesn't conflict with
the explicit cleanup below") and then calls remove_agent_worktree
explicitly. The disarm is essentially "use the explicit cleanup instead
of the implicit one because we want to await the result". For the
compound-review path, no out-of-band inspection currently happens, so
the guard always fires.
Assumption: a disarm (or keep) method is kept for symmetry
with the existing module and for future use, but is not exercised by
the compound-review path.
5.4 Must sweep run before any agent dispatch, or just before the first cron check?
Layer 2 inserts sweep_stale somewhere in Orchestrator::new or in
Orchestrator::run. The choice matters because:
Orchestrator::newispub fn new(config: OrchestratorConfig) -> Result<Self, OrchestratorError>-- synchronous. Calling an async sweep here requirestokio::runtime::Handle::current().block_on(...)or apollster-style helper. Adds a runtime-coupling.Orchestrator::runis async (the long-running event loop). Sweeping at the start ofrun, before the tick thread is spawned, is natural and async-friendly.
Assumption: sweep runs at the start of run, before the tick
thread is spawned at crates/terraphim_orchestrator/src/lib.rs:1218.
The first cron check (compound or per-agent) cannot complete a fire
before sweep finishes.
5.5 Upper bound on worktree age for the sweep claim
Section 3.5.1 raised the conflict between sweep and a hypothetical running review. Phase 2 must specify:
- Either an age threshold (e.g.
> max_duration_secs * 2), or - A liveness probe via pidfile, or
- An ordering guarantee (sweep before any new spawn) that obviates the threshold.
Assumption: ordering guarantee per 5.4 is sufficient; no age threshold is required because sweep is one-shot at startup.
5.6 Layer 3 in-tree shape
Open: does adf-cleanup.sh belong with the unit file fragment, or
standalone? The repository has no unit-file fragments today. Two
non-binding options:
- Standalone
scripts/adf-cleanup.sh+ a documentation snippet for the systemdExecStartPre=line indocs/operations/adf.md. - A fully self-contained
deploy/systemd/adf-orchestrator/directory holding both the unit and the script, that the ops team copies to/etc/systemd/system/.
Assumption: standalone script + docs snippet. Reduces churn against
the current scripts/adf-setup/ convention.
5.7 Should Layer 1's Drop invoke git worktree remove, or just delete the directory?
Section 3.2's tradeoff table presented four strategies. The follow-up question is which of them aligns with the existing precedent.
Today: WorktreeGuard uses std::fs::remove_dir_all (strategy A in
filesystem form). It does not invoke git, so .git/worktrees/<name>
remains.
Possible Phase 2 direction: keep Drop filesystem-only (fast,
already-precedented), but have Layer 2's sweep_stale always invoke
git worktree prune before walking directories. This pushes the
registry reconciliation to the slower sweep path and keeps the hot path
fast.
Open: whether the hot-path performance matters in practice (each
git worktree remove is sub-second; with 692 trees that is roughly 12
minutes of CPU). Phase 2 needs an empirical measurement, not a guess.
5.8 Cancellation propagation contract for spawned agents
Section 3.3's spawn race needs a Phase 2 decision: do we abort the handles, propagate a token, or accept the race?
Open. No assumption is recorded; this is the single most consequential design decision in #1569.
6. Cross-cutting concerns with #1562
6.1 Why Layer 0 limits Layer 1's blast radius
If #1569 ships without #1562, an unrelated regression that drops a guard prematurely will leak one worktree per 90 s while the storm runs. With #1562 in place, the same regression leaks one worktree per scheduled fire window (24 h for the default nightly cron), giving an operator 200-fold more time to notice the alert before disk fills up.
If #1562 ships without #1569, the cursor fix prevents the storm from firing, but any single OOM or panic during a normal nightly review still leaks. The bleed rate drops from "one per 90 s" to "one per crash" -- typically one or two per week, which is below the threshold a human would notice for months. This is in fact the regime the system was in before 2026-05-17 (the 13 leaked trees from earlier dates that co-existed with the storm's 692). #1569 is the long-tail fix.
6.2 Sequencing dependency
Both fixes can land independently. Recommended order:
- Land #1562 first (small, well-scoped, contained in
lib.rs). - Land #1569 second (touches
compound.rs,worktree_guard.rs, and adds tests for cancellation). - Land #1570 third (touches
scope.rs, addssweep_stale). - Land #1571 last (out-of-tree systemd integration; can be simulated in tests but only takes full effect on bigbox).
Rationale: each layer's tests can be deterministic given the layers beneath. Without #1562, #1569's tests would need to simulate cron-storm behaviour to be useful; with #1562 the storm cannot start and #1569 can be tested in isolation.
6.3 Risk of regressions across layers
The four layers share two surfaces:
- The
WorktreeManagertype, which Layers 1 and 2 both extend. - The
<worktree_root>/review-*directory contract, which Layers 1, 2, and 3 all assume.
A change to the worktree path convention (e.g. embedding a process PID
in the name, or using a subdirectory per pid) would invalidate all three
layers simultaneously. Phase 2 should freeze the path convention and
codify it in a constant -- WORKTREE_REVIEW_PREFIX: &str = "review-"
in scope.rs -- so Layer 3's shell script can reference the same
literal via a versioned source.
7. Out of scope
The following items came up during research and are deferred. They are not blockers for the four-layer epic and would expand the change surface meaningfully.
7.1 Moving compound review out of reconcile_tick
The compound-review entry path currently runs inline inside
handle_schedule_event (crates/terraphim_orchestrator/src/lib.rs:7513-7567),
which itself is called from reconcile_tick. A long-running review thus
blocks the entire reconciliation cycle. A natural refactor would be to
detach the review into a tokio::spawn background task with a
dedicated channel back into the orchestrator.
Deferred because:
- The detached-spawn pattern reintroduces the cancellation problem that this epic is solving.
- The fix would interact badly with the existing concurrency gate at
crates/terraphim_orchestrator/src/lib.rs:2154-2168. - It is orthogonal to the leak: a detached review still creates a worktree and still needs Layer 1 guarantees on cancellation.
7.2 Switching scheduler crate
The cron crate has known limitations (no native time-zone support,
iterator semantics that bit us in #1562). [tokio-cron-scheduler] and
[apalis] are alternatives. Migrating is a multi-week task and not
required for the leak fix.
Deferred to a dedicated tech-debt issue if the scheduling subsystem accumulates further bugs.
7.3 Agent worktree namespacing
The per-agent path uses /tmp/adf-worktrees/<agent_name>-<8char-uuid>.
Two agents with the same name run by two orchestrator processes (e.g.
during deploys) would collide. The current uuid suffix avoids name
collision in practice but does not prevent it formally. A namespacing
scheme (<orchestrator-pid>/<agent>-<uuid> or
/run/adf-<pid>/worktrees/<agent>-<uuid>) would harden this, at the
cost of a different sweep contract.
Deferred because the practical collision probability is zero (uuids) and the operational benefit is small.
7.4 sccache cleanup on worktree removal
/data/.../worktrees/review-* directories on bigbox contained large
target/ trees (~70 MiB each) and the sccache daemon's mmap'd
caches were observed across them. git worktree remove --force will
delete the files, but if sccache is running it may hold open file
descriptors that delay disk reclamation. A dedicated sccache --stop-server
before the sweep would be belt-and-braces.
Deferred because Linux unlinked-file semantics handle this correctly on filesystem level (deleted files are reclaimed when the last fd closes). Disk-pressure alerts would catch the residue.
7.5 Compound review backpressure
If the compound review legitimately takes longer than the cron interval
(e.g. a daily review that takes >24 h on a large repo), Layer 0 advances
the cursor but the actual review still overruns. A proper fix needs a
"single-flight" guard akin to last_cron_fire for compound reviews.
This is adjacent to #1562 -- in fact #1562 likely should also
introduce a last_compound_review_fire field -- but Phase 1 cannot
make that call without seeing #1562's PR.
Deferred to be raised on #1562's PR if not already addressed.
7.6 Existing WorktreeGuard registry leak
Section 3.7 identifies that the existing per-agent path leaks git registry entries even on the happy path. Fixing this is a small change but is technically out-of-scope for #1569 (whose stated objective is compound-review coverage). Phase 2 may choose to fold it in or split it into a sibling issue.
Deferred with a recommendation to fold in if effort is < 1 day.
8. Essential questions check
Per the disciplined-research skill, the following three questions gate Phase 2 entry.
8.1 Is this work energising?
Yes, on three counts:
- Concrete production incident. The 692 leaked worktrees on bigbox are visible, measurable, and have a clear cost (48 GiB disk, OOM risk, lost reviews). Closing the loop on a real incident is motivating in a way that abstract refactors are not.
- Idiomatic Rust opportunity. RAII via
Dropis one of Rust's signature strengths; building a textbook-qualityWorktreeGuardthat survives tokio cancellation is the kind of work that has lasting pedagogical value for the team. - Tight feedback loop. Each layer has a clear, testable acceptance criterion. Phase 2 can demonstrate progress in days, not weeks.
8.2 Does this leverage existing strengths?
Yes:
- The Rust+tokio expertise on the team is the right substrate for the
cancellation-safety work. The team has prior experience with
kill_on_drop,block_in_place, and cancellation tokens (cf. the existingWorktreeGuarditself). - The orchestrator code is well-instrumented with
tracing::; adding the new sweep events is a one-line-per-event change. - The systemd
ExecStartPreintegration is standard Linux administration; no novel ops infrastructure is required.
8.3 Does it meet a real need?
Yes, by all four signals:
- Production incident as evidence. Section 1.1 cites bigbox state with specific numbers; there is no doubt the bug is real.
- Storm postmortem timeline. Compound review #514 silent from 2026-05-13 to 2026-05-17 -- four days of missed feedback to humans during which the system thought it was working.
- Recurrence likelihood. The bug class (RAII not applied to a
long-lived async future) is a category that will reproduce in any
new long-running orchestrator path unless the pattern is internalised.
Landing a documented, tested
WorktreeGuardfor compound review inoculates against the next case. - User-stated priority. The epic is named, the four issues are filed, and the brief explicitly requests the disciplined-research workflow rather than ad-hoc fixes.
The evidence is convergent: the work is well-scoped, well-motivated, and aligned with team capability.
Phase 1 deliverable
This document is the Phase 1 deliverable. It captures problem framing, existing-implementation map, constraints, observability gaps, open questions, sequencing dependencies, and out-of-scope items, with file:line citations throughout. No code has been changed.
Recommended next steps:
- Review this document. Resolve sections 5.7 (Drop strategy) and 5.8 (cancellation propagation) before Phase 2 begins -- they are the only blocking open questions.
- Land #1562 to stop the bleed.
- Enter Phase 2 (Design) for #1569 with the resolutions to 5.7 and 5.8 in hand.
- Phases 2 and 3 for #1570 and #1571 follow #1569.