ROC v1 Staged Rollout Runbook
ROC v1 Step L — Refs terraphim/adf-fleet#40
Staged rollout of the full ROC v1 agent fleet to production projects. Order: digital-twins (48h soak) → odilo (48h soak) → terraphim.
Pre-flight Checklist
Complete all items before touching any project configuration.
1. Verify orchestrator version
All ROC v1 Steps A–K must be merged into main on bigbox before this runbook is executed.
# Confirm the merge commit for ROC v1 Step K (cron cadence) is present.2. Confirm orchestrator service is healthy
# Expected: Active: active (running)
|
# No FATAL or PANIC lines.3. Verify webhook endpoint is reachable from bigbox
# Expected: 4xx (payload rejected, but endpoint alive). 000 = not reachable.4. Confirm Quickwit indices exist
|
# Must include at minimum: adf-events-digital-twins, adf-events-odilo, adf-events-terraphim
# Create missing indices via the Quickwit UI or CLI before proceeding.5. Read the webhook secret
# Note the `secret` value. Used in webhook registration commands below.
# Do NOT echo this secret into terminal history.Phase 1: digital-twins (48h soak)
Step 1.1 — Register Gitea webhook
Run from bigbox (replace <WEBHOOK_SECRET> with the value from Pre-flight step 5):
|
# Note the returned hook id for rollback reference.Step 1.2 — Update conf.d cron for digital-twins
Edit the live config on bigbox. This is operator work: do not apply via automation.
Find every implementation-swarm agent block and apply this diff:
# Before (hourly):
schedule = "0 * * * *"
# After (every 20 minutes):
schedule = "*/20 * * * *"Reviewer (pr-reviewer) blocks have no schedule field — leave them unchanged.
Meta (project-meta, fleet-meta) blocks remain at */30 * * * * — leave unchanged.
Validate the edited file:
# Expected: OK (no parse errors)Step 1.3 — Restart orchestrator
# Expected: Active: active (running)Step 1.4 — Verify webhook delivery (smoke test)
Open a test comment on any terraphim/digital-twins issue, then check:
|
# Expected: >= 1 within 2 minutes of the comment.If 0 after 5 minutes, check orchestrator logs:
| Step 1.5 — Monitor for 48 hours
Run acceptance queries at T+24h and T+48h (see Acceptance Queries section below).
Phase 1 Acceptance Gate
Both conditions must be true before proceeding to Phase 2.
# Revert rate must be 0:
|
# MUST be 0 to proceed.
# At least one verified auto-merge:
|
# MUST be >= 1 to proceed.If either condition fails, execute the Rollback Procedure for digital-twins before touching odilo.
Phase 2: odilo (48h soak)
Prerequisite: Phase 1 acceptance gate passed.
Step 2.1 — Register Gitea webhook
| Step 2.2 — Update conf.d cron for odilo
Apply the same diff as Phase 1 Step 1.2 to all implementation-swarm blocks.
Step 2.3 — Restart orchestrator
Step 2.4 — Verify webhook delivery
| Step 2.5 — Monitor for 48 hours
Phase 2 Acceptance Gate
# Revert rate must be 0:
|
# MUST be 0 to proceed.
# At least one verified auto-merge:
|
# MUST be >= 1 to proceed.If either condition fails, execute the Rollback Procedure for odilo.
Phase 3: terraphim
Prerequisite: Phase 2 acceptance gate passed.
Step 3.1 — Register Gitea webhook
| Step 3.2 — Update conf.d cron for terraphim-ai
Apply the same implementation-swarm schedule diff:
# Before:
schedule = "0 * * * *"
# After:
schedule = "*/20 * * * *"Step 3.3 — Restart orchestrator
Step 3.4 — Verify webhook delivery
| Step 3.5 — Monitor for 48 hours
Phase 3 Acceptance Gate
# Revert rate must be 0:
|
# MUST be 0 to proceed.
# At least one verified auto-merge:
|
# MUST be >= 1.Rollback Procedure
Execute if revert rate > 0 in any phase, or if the orchestrator becomes unhealthy.
1. Disable the webhook for the affected project
# List hooks for the project:
|
# Disable (replace <HOOK_ID>):
2. Revert the cron bump in conf.d
# Restore: schedule = "0 * * * *" for all implementation-swarm blocks
3. Restart orchestrator
4. Open a retro issue on adf-fleet
Do not proceed to the next phase until the retro issue is resolved.
Observability: Quickwit Search Queries
Use these queries during and after each soak period to track fleet activity.
Replace <INDEX> with the project-specific index name.
| Event | Index suffix | Meaning |
|-------|-------------|---------|
| pr_reviewed | per-project | PR reviewer agent completed a review |
| pr_auto_merged | per-project | Merge-coordinator merged a PR automatically |
| pr_auto_reverted | per-project | A merged PR was subsequently reverted |
| pr_auto_merged_verified | per-project | Auto-merged PR passed post-merge test gate |
Active event count (last 48h)
PROJECT=digital-twins # or: odilo, terraphim
INDEX="adf-events-"
SINCE=
for; do
count=
Expected healthy output after 48h soak:
pr_reviewed: <N> # >0
pr_auto_merged: <N> # >0
pr_auto_reverted: 0 # must be exactly 0
pr_auto_merged_verified: <N> # >=1Tail recent events (troubleshooting)
| Config Delta Reference
Shown for reference only. The operator applies these changes manually (see per-phase steps above). Do not apply these diffs via automation.
/opt/ai-dark-factory/conf.d/<PROJECT>.toml — implementation-swarm blocks
[[agents]]
name = "implementation-swarm"
layer = "Core"
-schedule = "0 * * * *"
+schedule = "*/20 * * * *"All other agent blocks are unchanged:
project-metaandfleet-metaremain at*/30 * * * *.pr-reviewerhas noschedulefield (event-driven via webhook mention dispatch).- Safety tier agents (
security-sentinel,compliance-watchdog,drift-detector,test-guardian,spec-validator,documentation-generator) retain their own schedules.
Summary Checklist
| Step | digital-twins | odilo | terraphim | |------|:---:|:---:|:---:| | Webhook registered | [ ] | [ ] | [ ] | | conf.d cron updated | [ ] | [ ] | [ ] | | Orchestrator restarted | [ ] | [ ] | [ ] | | Webhook delivery verified | [ ] | [ ] | [ ] | | 48h soak complete | [ ] | [ ] | [ ] | | pr_auto_reverted = 0 | [ ] | [ ] | [ ] | | pr_auto_merged_verified >= 1 | [ ] | [ ] | [ ] |