Hybrid Code Search That Knows When to Ask an LLM
Published 2026-05-24
Terraphim Grep now does end-to-end hybrid search: ripgrep-fast code retrieval through fff-search, parallel knowledge-graph concept extraction, and -- when the local retrievers do not return enough signal -- an LLM synthesises a cited answer.
This post walks through the design decisions in the PR landing on
task/1743-terraphim-grep, why we did not write a single mock, and what the first
benchmark numbers tell us.
The Problem We Hit
The existing code search pipeline answered "where does this string appear" well. It did not answer "where is retry configured" -- a question that bridges a user's intent and the code structure. A grep hit list helps only if the user already knows the symbol.
Our V-model verification phase surfaced two defects when we tried to claim completion:
- D001: the CLI binary instantiated
TerraphimGrepwithout ever wiring an LLM client. Any query that landed in theNeedsSynthesisbranch -- which is most of them, because the default sufficiency thresholds require both KG hits and chunk diversity -- hard-errored withLlmNotConfigured. - D005: even when no LLM was available, the grep refused to return partial results. Search-only mode existed on paper but not in code.
Both defects existed because we had been testing the data structures, not the user-visible behaviour. The verification report turned them up in the first 30 minutes.
Why We Wired build_llm_from_role, Not RouterBridgeLlmClient
Terraphim ships a capability-based router in terraphim_router. Its keyword router
extracts a Capability (CodeGeneration, Explanation, DeepThinking, ...) from a prompt;
its strategies (CostOptimized, CapabilityFirst, LatencyOptimized) pick a provider from a
registry. RouterBridgeLlmClient wraps the router so that any consumer of the
LlmClient trait gets routing transparently.
There were two ways to plug grep into that:
- Wire
RouterBridgeLlmClientdirectly. Construct the provider registry inside grep. - Wire
terraphim_service::llm::build_llm_from_role(&role). Let it decide whether to return a direct provider or aRouterBridgeLlmClientbased onrole.llm_router_enabled.
We chose (2). build_llm_from_role already owns the precedence rules -- explicit
provider, nested extra, genai-with-model, OpenRouter config, Ollama hints -- and aligns
grep with how the server, TUI, and RLM consume providers. Whether routing kicks in is a
role config decision, not a grep code decision. Grep itself never knows whether it
got back an OllamaClient or a RouterBridgeLlmClient. It just sees Arc<dyn LlmClient>.
The wiring is six lines:
let llm = build_llm_from_role;
let grep = new;
let grep = match llm ;Everything else -- which model, which routing strategy, which fallback -- lives in the role JSON.
Graceful Degradation Is a Feature
Before D005, calling grep without an LLM produced this:
Error: Search failed
Caused by: LLM not configured: LLM client not configuredAfter D005, the same call returns the chunks the local retrievers found:
The principle: if you have the data, return the data. Synthesis is the enrichment, not the entry barrier. The CLI is now usable on a laptop with no API key and no Ollama.
force_rlm = true still fails fast when no LLM is configured -- if you explicitly asked
for synthesis, silently dropping back is worse than an error.
No Mocks: Four Test Layers, All Real
The project's no-mocks rule (in CLAUDE.md: "never use mocks in tests") forced a more
honest test pyramid:
- L1 unit (inline, no network). Prompt assembly, signature parsing, sufficiency
routing decisions, and the new graceful-degrade path against a real
fff-searchscan of a tempdir corpus. - L2 router-capability (no network). Feed real grep synthesis prompts to a real
terraphim_router::Routerwith two registered providers. Assert which capability was extracted and which provider won. Three tests cover explanation queries, code implementation queries, and the no-keyword fallback. This catches the case where changing grep's prompt wording silently breaks routing. - L3 e2e smoke (
#[ignore], live OpenRouter free model). Fullfff → KG → sufficiency → LLM → citationsagainstliquid/lfm-2.5-1.2b-instruct:free-- 1.2B parameters, sub-second responses, zero cost. Borrowed theis_account_issue()pattern fromdocs/OPENROUTER_TESTING_PLAN.mdso 401/403/429 errors degrade to a skip rather than a failure. - L4 quality (
#[ignore], manual). Same shape as L3 but pointed atqwen/qwen3-coder:freefor stronger answers on the code-specialised use case. Run before release, not on every CI pass.
The free OpenRouter models we settled on:
| Capability target | Model | Reason |
|---|---|---|
| Code-aware synthesis | qwen/qwen3-coder:free | Code-specialised |
| Multi-chunk reasoning | meta-llama/llama-3.3-70b-instruct:free | Strong general |
| Fast CI smoke | liquid/lfm-2.5-1.2b-instruct:free | 1.2B params, lowest quota burn |
OpenRouter's free tier caps at 20 req/min and 200 req/day (1000 if you have ever deposited
$10), so we keep live tests behind #[ignore] and run one live call per capability rather
than repeating the same call in multiple tests.
What we explicitly did not do:
- No hand-rolled
MockLlmClient. It would only prove the trait wiring, whichcargo checkalready proves. - No
wiremock. The recorded responses go stale and the dependency adds nothing. - No deterministic-temperature assertions on response text. Even at temperature 0, model upgrades break those tests.
Benchmarks: Where Does the Time Go?
Three criterion groups in crates/terraphim_grep/benches/hybrid_search.rs:
code_only-- fff-search alone, no KG, against 10/100/500 file corporahybrid_with_kg-- parallel fff + KG concept extraction, varying the thesaurus from 10 to 10,000 termsfuse_and_rank-- isolated sort/rank cost across chunk batches from 10 to 10,000
The first finding from hybrid_with_kg:
hybrid_with_kg/thesaurus_terms/10 3.2664 ms 30.6 Kelem/s
hybrid_with_kg/thesaurus_terms/100 3.4457 ms 29.0 Kelem/s
hybrid_with_kg/thesaurus_terms/1000 3.2413 ms 30.9 Kelem/s
hybrid_with_kg/thesaurus_terms/10000 3.2126 ms 31.1 Kelem/sHybrid latency stays flat at ~3.2 ms across three orders of magnitude of thesaurus size. The parallel fff scan dominates wall-clock; the KG search is fast enough to vanish in the noise. This is a useful baseline: if anyone proposes KG-pruning as a perf win, the bench will tell them whether the optimisation actually moves the needle.
fuse_and_rank scales roughly linearly with chunk count, as expected for a comparator
sort:
fuse_and_rank/chunks/10 445 ns 22.4 Melem/s
fuse_and_rank/chunks/100 5.38 us 18.6 Melem/s
fuse_and_rank/chunks/1000 65.0 us 15.4 Melem/s
fuse_and_rank/chunks/10000 849 us 11.8 Melem/sThroughput drops at scale because the sort fights cache locality, but absolute latency stays well below the network round-trip budget.
Your Knowledge Tops the Results
fff-search returns a uniform relevance_score = 1.0 per match. Sorting by that score
is meaningless; without an ordering signal you would be back to "thirty hits, read them
all." We added a KG-aware boost so the chunks whose source path or content matches
your thesaurus concepts move to the top.
The shape:
;
;For each chunk we lowercase the source path and content once, then for each matched
concept we check whether its name (or display_value) appears in either. Matching
concepts contribute their normalised score; unmatched ones do not. The boost is added
to the chunk's existing relevance_score and the JSON output shows the boosted number,
so downstream tools can see why a chunk ranked where it did.
Default weight is 1.0: a fully-KG-matched chunk roughly doubles its score versus an
unmatched chunk. The unit tests pin this contract down:
And the kg_boost_overhead benchmark group quantifies the cost. First numbers, 1000
chunks against varying concept counts:
kg_boost_overhead/concepts/0 49.7 us sort only, no concepts
kg_boost_overhead/concepts/10 478 us
kg_boost_overhead/concepts/100 3.85 ms
kg_boost_overhead/concepts/1000 36.0 msThe cost grows linearly with chunks * concepts (one substring search per pair). At
typical grep scale -- say 50 chunks and a handful of KG concepts matched per query --
the boost adds under 25 microseconds to a 3.2 ms hybrid search. Less than 1% overhead.
The 1000-concept case (36 ms) is pathological -- it would require every term in the thesaurus to have matched the query, which does not happen in practice. The bench exists so future regressions on the algorithm get caught.
What Did Not Change
- The fff-search integration itself. It was already correct -- proven by
fff_search::file_pickerlog lines on every run. - The sufficiency judge thresholds. The defaults (min_coverage 0.7, min_kg_confidence 0.5, min_diversity 2, min_results 3) still steer most queries into NeedsSynthesis. Tuning those is a separate conversation -- the bench gives us the latency data to do it on.
- The fff-search integration itself.
What This Unlocks
A grep that knows when it is not sure and asks for help. The CLI works without any LLM configured (search-only mode). With OpenRouter or Ollama in the role config, it routes the synthesis call by capability through the existing router infrastructure -- same strategy as the rest of the platform. No new orchestration code, no new model hardcoding, no mocks in the test pyramid.
The architectural payoff is small but real: every consumer of LlmClient in the
codebase now goes through the same build_llm_from_role entry point. When someone adds
a new provider (Claude, Gemini, a self-hosted model), it shows up in grep automatically.
When someone changes the routing strategy, it changes everywhere at once.
That is the test that matters: not "does this PR work" but "did we make the next PR easier."
Try It
# Search-only mode (no LLM required)
# With OpenRouter synthesis