ADF model selection and agent spawn — reference
How the AI Dark Factory orchestrator picks a model for each dispatch and starts the agent subprocess. This walkthrough covers the whole pipeline from the dispatch queue through to the child process, cross-referencing the code and the log lines you will see in the journal.
Stage map
tick ──► DispatchContext ──► RoutingDecisionEngine.decide_route
│
┌────────────────────────┼────────────────────────┐
│ │ │
KG match Keyword match Static config
│ │ │
└─────────── merge ──────┴────────── append ──────┘
│
C1/C3 allow-list filter
│
ProviderBudgetTracker filter
│
score + telemetry
│
pick best
│
▼
spawn_with_fallback
│
try primary ──► model_args ──► tokio::process::Command
│
on fail ──► try fallback (same path)
│
▼
AgentHandle
│
subprocess ──► exit classifier ──► OutputPosterStage 1: DispatchContext assembly
The reconcile loop ticks every tick_interval_secs (default 30, set in
orchestrator.toml). Each tick drains four sources:
- Cron-scheduled wakeups from
[[agents]].schedulecron fields - Mention-polled comments (repo-wide
issues/commentsAPI, cursor-tracked) - Webhook-fed
issue_commentandpull_requestevents - The in-memory
DispatchTaskqueue (ReviewPr, AutoMerge, PostMergeTestGate)
For each item about to dispatch, the orchestrator builds a DispatchContext:
DispatchContext Plus a BudgetVerdict from the agent's per-month USD cost budget
(ProviderBudgetTracker). BudgetPressure comes out as NoPressure,
NearExhaustion, or Exhausted.
The SpawnContext also carries per-agent Gitea identity. When the
orchestrator's OutputPoster has loaded an entry for (project, agent)
from agent_tokens.json, build_spawn_context_for_agent injects
GITEA_TOKEN into env_overrides. This overrides the shared root token
from ~/.profile, so gtr / curl calls inside the agent's own task
shell authenticate as the agent's own Gitea user — not root. The
OutputPoster wrapper comment + any direct gtr comment the agent
makes both show author = <agent-name> in Gitea.
Stage 2: RoutingDecisionEngine picks the model
crates/terraphim_orchestrator/src/control_plane/routing.rs —
decide_route(&self, ctx: &DispatchContext, budget_verdict: &BudgetVerdict).
2a. Short-circuit for non-routable CLIs
if !Selfsupports_model_flag Unknown CLI binaries (anything other than claude, opencode, codex) fall through here.
2b. Collect candidates from three sources
collect_all_candidates gathers independent candidates:
| Source | Data | Example route |
|---|---|---|
| Knowledge Graph | Aho-Corasick match of task_keywords against docs/taxonomy/routing_scenarios/adf/{planning,implementation,review}_tier.md — trigger::, synonyms::, route::, action:: directives | concept=review_tier → route:: anthropic, haiku |
| Keyword routing | KeywordRouter built from register_providers() in bin/adf.rs; each Provider declares a keywords list | "implement/code/generate" → kimi-for-coding/k2p6 |
| Static config | model = "..." on the agent TOML | security-sentinel sets no static → empty |
2c. Merge candidates
If KG and Keyword both pick the same model string, merge into a single
CombinedKgKeyword candidate whose confidence is the average of the two.
Otherwise the engine keeps them as distinct candidates for later scoring.
Static config always appends.
2d. C1/C3 allow-list filter (defence in depth)
all_candidates.retain;Allowed prefixes live in ALLOWED_PROVIDER_PREFIXES (config.rs:794):
claude-codeopencode-gokimi-for-codingminimax-coding-planzai-coding-planopenai(subscription-gated via the OpenAI Plus/Pro/Team plan)
Bare names sonnet, opus, haiku are recognised as claude CLI targets and
always allowed. Load-time validation (validate()) already enforces the same
rule on the TOML; this filter exists so a malformed KG file or a telemetry
artefact cannot surface a banned target at runtime.
2e. ProviderBudgetTracker filter
Each candidate's provider key is checked against the two-window tumbling
tracker (hour bucket id YYYYMMDDHH, day bucket id YYYYMMDD):
| Verdict | Action |
|---|---|
| Exhausted | drop + routing: dropped provider-budget-exhausted candidate |
| NearExhaustion | keep; later score ×0.6 |
| Ok | keep |
If every candidate got dropped by either filter, the engine returns
CliDefault with a descriptive rationale and flags primary_available=false
so downstream logging knows this is a degraded fallback.
2f. Score candidates
confidencecomes from the match strength (KG rule priority, keyword match count, or 1.0 for explicit static config).cost_penaltyis0.0underNoPressure, rising underNearExhaustionandExhaustedso expensive providers lose out when the monthly budget is nearly spent.
Then the engine multiplies by 0.6 for any candidate whose provider is
flagged NearExhaustion in the per-provider budget tracker, so a provider
at 80 %+ of its quota is still eligible but loses to a healthy alternative.
2g. Telemetry adjustment
If TelemetryStore is wired (adf-fleet#15 path), each candidate's historical
(success_rate, latency_p95) is fetched and folded into the pressured score.
Low success rate or high latency further deprioritises.
2h. Pick the winner
Sort by pressured score descending and take [0]. The engine emits:
INFO model selected via KG tier routing agent=security-sentinel
concept=review_tier provider=anthropic model=haiku confidence=0.6When the primary was dropped by unhealthy-providers the orchestrator emits a
separate log from kg_router:
INFO KG routed to fallback (primary unhealthy)
agent=merge-coordinator concept=review_tier
provider=minimax model=minimax-coding-plan/MiniMax-M2.5
skipped_unhealthy=["anthropic", "openai"]Stage 3: spawn_with_fallback
crates/terraphim_spawner/src/lib.rs —
spawn_with_fallback(&self, request: &SpawnRequest, ctx: SpawnContext).
SpawnRequest bundles:
primary_provider,primary_model(selected in Stage 2)fallback_provider,fallback_model(from agent TOML)task(composed prompt: persona + skill chain + task envelope)resource_limits(RLIMIT_AS, RLIMIT_CPU, ulimit walls)use_stdin(whether to pipe the prompt in via stdin — true for long prompts to avoid ARG_MAX)
match self.spawn_with_options.await Fallback selection is per-agent, not per-routing-decision. The routing
engine selects the best primary; the TOML-declared fallback is used only if
the primary provider physically fails to start (binary missing, arg error,
permission denied, etc.). For per-candidate fallback across the KG route
list, see kg_router::route_agent which iterates the route:: entries in
order and emits KG routed to fallback (primary unhealthy).
Stage 4: model_args picks the right CLI flag
crates/terraphim_spawner/src/config.rs:139:
The normaliser for claude CLI prepends claude- when a version suffix is
present. Bare aliases stay bare so they track the latest Anthropic release.
Stage 5: subprocess launch
Tokio Command builder:
let mut cmd = new;
cmd.args
.args // --allowedTools, --print, ...
.envs // ADF_PR_NUMBER, GITEA_TOKEN, ...
.stdin
.stdout
.stderr
.spawn?;The ctx carries per-dispatch env overrides: GITEA_WORKING_DIR,
GITEA_OWNER, GITEA_REPO, GITEA_TOKEN, and for ReviewPr dispatches the
ADF_PR_NUMBER / ADF_PR_HEAD_SHA / ADF_PR_PROJECT / ADF_PR_AUTHOR /
ADF_PR_DIFF_LOC set. The composed prompt (persona + skill chain + task
envelope + mention context) is written to stdin by the spawner.
Trace line:
INFO spawning agent agent=security-sentinel layer=Safety
cli=/home/alex/.local/bin/claude model=Some("haiku")Stage 6: exit classification feeds back into the next decision
When the child exits, ExitClassifier folds together the exit code and the
last 200 lines of stderr against ProviderErrorSignatures:
| Verdict | Signal | Effect |
|---|---|---|
| success | exit 0 | nothing |
| throttle | stderr regex matched from the provider's throttle list | trip CircuitBreaker for that provider; call ProviderBudgetTracker::force_exhaust() so routing drops this provider until the next window |
| flake | stderr regex matched from the flake list | no breaker trip; next dispatch retries |
| resource_exhaustion | matched_patterns=["oom"] / "killed" / "signal: 9" | count as soft failure; raises concern flag for fleet-meta |
| unknown | no pattern matched | count as soft failure; escalate pattern to fleet-meta for a human to classify |
Trace line:
INFO agent exit classified agent=security-sentinel exit_code=Some(0)
exit_class=success confidence=1.0 wall_time_secs=90.0Stage 7: OutputPoster writes back to Gitea
crates/terraphim_orchestrator/src/output_poster.rs —
post_agent_output_for_project(project, agent_name, issue_number, output_lines, exit_code).
Resolves the per-project GiteaTracker, looks up the per-agent Gitea token
(agent_tokens.json) so the comment lands under the agent's own login, and
POSTs to /api/v1/repos/{owner}/{repo}/issues/{issue_number}/comments.
Known bug recently fixed (adf-fleet#44, PR #738)
Before commit 7cf60d2c, RepoComment::issue_number was extracted only
from issue_url. For comments on pull requests Gitea returns
pull_request_url instead, so every PR comment arrived with
issue_number = 0 and OutputPoster tried to post to /issues/0/comments —
500 from Gitea. The fix reads pull_request_url as a fallback. PRs share
the issue numeric namespace so the same trailing-segment extraction works
for both URLs.
Per-agent identity end-to-end (PR #741)
OutputPoster::has_own_token controls the wrapper comment's author.
OutputPoster::agent_token(project, name) exposes the raw token string
so build_spawn_context_for_agent can inject GITEA_TOKEN into the
spawned child's env. Together these close the attribution loop:
| Path | Token used | Gitea author |
|---|---|---|
| OutputPoster wrapper: "Agent X completed" | per-agent | X |
| Agent's own gtr comment in task shell | per-agent (via env override) | X |
| Agent lookup misses agent_tokens.json | project root token | root |
agent_tokens.json maps every agent name listed in conf.d/*.toml to a
Gitea personal access token. If the map is empty or the path is not set
on [projects.gitea], every agent on that project falls back to root.
Meta-coordinator as work dispatcher
On both digital-twins and terraphim-ai the meta-coordinator's task
envelope now follows the canonical scope-gate + dispatch pattern from
scripts/adf-setup/agents/meta-coordinator.toml:
gtr ready→ highest-PageRank unblocked issue- Haiku scope-clarity check via
claude -p --model haiku --allowedTools ""(pure text, no tool surface — prompt-injection safe) - If unclear →
gtr commentwith a "needs more detail" note, skip if already posted in the last 24 h (idempotency) - If clear → Haiku role classifier picks one of implementation-swarm, quality-coordinator, security-sentinel, compliance-watchdog, spec-validator, test-guardian, documentation-generator
gtr comment "@adf:<role> please pick up issue #N"on the ready issue → the mention parser dispatches the named role at the next poll tick
Terraphim's previous fleet-health-report pattern (writing to
/opt/ai-dark-factory/reports/ and posting to issue #107) has been
replaced. Fleet-health reporting is now the job of fleet-meta
(cross-project) and the journal + Quickwit indices.
Example: trace from the journal
Security-sentinel got a model at 17:33:08 CEST on 2026-04-21:
17:33:08.232 provider probe complete providers_probed=11 healthy=11
17:33:08.244 model selected via KG tier routing agent=security-sentinel
concept=review_tier provider=anthropic model=haiku confidence=0.6
17:33:08.244 spawning agent agent=security-sentinel layer=Safety
cli=/home/alex/.local/bin/claude model=Some("haiku")
17:34:38.258 agent exit classified agent=security-sentinel exit_code=Some(0)
exit_class=success wall_time_secs=90.0Reading the trace:
- Probe round completed, all 11 providers healthy, including the three
openai/*variants (restored by PR #737 andkimi-for-coding/k2p6). - Task keywords matched the
review_tierKG concept (trigger:: verify, validate, ...among others). Theroute::list for review_tier hasanthropic, haikuas the primary entry. - C1 filter:
haikuis bare, maps to claude CLI, allowed. - Budget filter: no near-exhaustion this hour.
- Score:
source_weight(KnowledgeGraph)=1.0 × confidence(0.6) = 0.6. No telemetry penalty, no pressure penalty. - Spawn
claude --model haiku --allowedTools ... --printwith the composed prompt on stdin. - 90 s later exit 0 → OutputPoster posts the verdict to the security-sentinel standing log on Gitea.
Knobs
| Config | File | Effect |
|---|---|---|
| tick_interval_secs | orchestrator.toml | reconcile cadence (default 30 s) |
| probe_ttl_secs | orchestrator.toml | how often ProviderHealthMap re-probes (default 1800 s) |
| [projects.mentions].poll_modulo | conf.d/<project>.toml | poll mentions every N ticks (default 2 → 60 s) |
| [projects.mentions].max_dispatches_per_tick | conf.d/<project>.toml | cap on dispatches per reconcile (default 3) |
| [post_merge_gate].max_test_duration_secs | orchestrator.toml | wall-clock budget for cargo test --workspace (default 600 s) |
| agent schedule | conf.d/<project>.toml | per-agent cron expression |
| agent model | conf.d/<project>.toml | static-config routing candidate |
| agent fallback_provider + fallback_model | conf.d/<project>.toml | spawn_with_fallback target when primary fails |
| C1 allow-list | config.rs:794 ALLOWED_PROVIDER_PREFIXES | recompile required |
| KG routing table | docs/taxonomy/routing_scenarios/adf/*.md | hot-reloaded each orchestrator start |
Further reading
crates/terraphim_orchestrator/src/control_plane/routing.rs— the full decision enginecrates/terraphim_orchestrator/src/kg_router.rs— KG table parser + per-route health fall-throughcrates/terraphim_orchestrator/src/provider_probe.rs— probe cadence andProviderHealthMapcrates/terraphim_orchestrator/src/provider_budget.rs— tumbling-window budgets +BudgetVerdictcrates/terraphim_orchestrator/src/error_signatures.rs— per-provider stderr classifiercrates/terraphim_spawner/src/lib.rs— spawn + fallback + resource limitscrates/terraphim_spawner/src/config.rs—model_args,normalise_claude_model, API-key inference per CLIcrates/terraphim_orchestrator/src/output_poster.rs— Gitea write-backdocs/runbooks/roc-v1-rollout.md— operator view of the auto-review + auto-merge lifecycle these decisions feed into