Research Document: terraphim-grep — Intelligent Hybrid Grep
Status: Draft Author: Agent Date: 2026-05-23 Gitea Issue: #1743 Research Phase: Phase 1 of Disciplined Development
Executive Summary
terraphim-grep is an intelligent grep tool that uses hybrid search (FFF + ripgrep + KG) for fast deterministic results, falling back to RLM only when needed. The system learns from interactions by writing new concepts back to the knowledge graph, improving future searches. This is the inverse of rlmgrep's "brute force LLM" approach.
Essential Questions Check
| Question | Answer | Evidence | |----------|--------|----------| | Energizing? | Yes | Leverages existing Terraphim infrastructure (RoleGraph, HaystackProvider, KgPathScorer) | | Leverages strengths? | Yes | Terraphim has fast Aho-Corasick search, KG curation, RLM sandboxing already built | | Meets real need? | Yes | rlmgrep is expensive ($0.01-0.05/query) and slow (15-30s); terraphim-grep targets $0.001/query at 0.1-5s |
Proceed: Yes (3/3 YES)
Problem Statement
Description
Code search tools either use simple regex (no semantic understanding) or load entire corpuses into LLM context (expensive, slow). Neither approach learns from interactions.
Impact
- Developers waste time crafting regex patterns
- LLM-based search is prohibitive at scale ($0.01-0.05 per query)
- No learning between queries means repeated work
Success Criteria
- 80-90% of queries answered by hybrid search alone (< 0.5s, $0.0001)
- RLM fallback only when search insufficient (< 5s, $0.005)
- KG learns new concepts from RLM interactions
- rlmgrep-compatible CLI interface
Current State Analysis
Existing Implementation
| Component | Location | Purpose |
|-----------|----------|---------|
| LLM Bridge (stubbed) | crates/terraphim_rlm/src/llm_bridge.rs | HTTP bridge for VM-to-host LLM calls; returns LlmNotConfigured without llm feature |
| RoleGraph | crates/terraphim_rolegraph/src/lib.rs | Knowledge graph with Aho-Corasick O(n) concept detection + TF-IDF fallback |
| HaystackProvider trait | crates/haystack_core/src/lib.rs | Uniform async search interface over heterogeneous backends |
| KgPathScorer | crates/terraphim_file_search/src/kg_scorer.rs | Scores files by KG concept matches in path; implements ExternalScorer for fff-search |
| LLM Client trait | crates/terraphim_service/src/llm.rs | LlmClient trait with chat_completion, summarize methods |
| CLI structure | crates/terraphim_cli/src/main.rs | Existing clap-based CLI with Search, Find, Extract commands |
Integration Points
terraphim_rlmusesterraphim_service::llm::LlmClientwhenllmfeature enabledfff-search(external crate) provides file discovery withExternalScorertraitKgPathScorerreads thesaurus and runs Aho-Corasick viaterraphim_automata::find_matches
Data Flow (Current)
Query → terraphim_cli → CliService → RoleGraph.search() → Ranked results
→ HaystackProvider.search() (optional)Constraints
Technical Constraints
- LLM Bridge is stubbed:
crates/terraphim_rlm/src/llm_bridge.rs:233returnsLlmNotConfiguredwithoutllmfeature - Edition 2024: Workspace uses Rust edition 2024
- No existing hybrid orchestrator: Need to build
HybridSearcherfrom scratch - FFF external dependency:
fff-searchis an external git dependency
Business Constraints
- 12-day implementation roadmap (from issue #1743)
- Must maintain rlmgrep compatibility for CLI interface
Non-Functional Requirements
| Requirement | Target | Current | |-------------|--------|---------| | Cost per query | < $0.001 avg | N/A (doesn't exist) | | Latency (search-only) | < 0.5s | N/A | | Latency (with RLM) | < 5s | N/A | | Learning | KG updates after RLM | N/A |
Vital Few (Essentialism)
Essential Constraints (Max 3)
| Constraint | Why Vital | Evidence |
|------------|-----------|----------|
| LLM bridge must work | RLM fallback requires real LLM calls | llm_bridge.rs:233 currently returns error |
| Hybrid search must be fast | Core value prop is cost/latency | Need parallel search across haystacks |
| KG must update after RLM | Learning is the key differentiator | RLM extracts → RoleGraph.add_concept() |
Eliminated from Scope
| Item | Why Eliminated | |------|----------------| | Firecracker VM integration | Not needed for CLI tool; Docker backend sufficient | | MCP server integration | Future phase | | Multi-modal ingestion (PDF/images) | Port from rlmgrep later | | Streaming output | Not in initial scope |
Dependencies
Internal Dependencies
| Dependency | Impact | Risk | |------------|--------|------| | terraphim_service::llm::LlmClient | RLM fallback needs chat_completion | Low - trait exists | | terraphim_rolegraph::RoleGraph | KG reads/writes | Low - well-tested | | terraphim_file_search::KgPathScorer | Path-based scoring | Low - implements ExternalScorer | | fff_search | File discovery | Medium - external crate |
External Dependencies
| Dependency | Version | Risk | Alternative | |------------|---------|------|-------------| | fff-search | git branch | Medium - external | ripgrep directly | | genai (LLM) | git fork | Medium | OpenRouter direct, Ollama |
Risks and Unknowns
Known Risks
| Risk | Likelihood | Impact | Mitigation | |------|------------|--------|------------| | LLM bridge integration complexity | Medium | High | Phase 1 focuses on this; reuse terraphim_service LlmClient | | External fff-search crate instability | Low | Medium | Implement fallback to ripgrep directly | | KG automaton rebuild cost | Low | Low | Background task, not blocking |
Open Questions
- Sufficiency Judge threshold tuning - How to determine "sufficient" vs "needs RLM"? Need empirical tuning.
- Context window management - How to limit retrieved chunks to avoid LLM context overflow?
- KG curation rate limiting - How often to update automata? (Every interaction vs batched)
Assumptions Explicitly Stated
| Assumption | Basis | Risk if Wrong | Verified? | |------------|-------|---------------|-----------| | terraphim_service::llm::LlmClient.chat_completion is sufficient | It's the existing LLM interface | RLM needs different interface | No - may need wrapper | | RoleGraph.add_concept() is safe for concurrent access | Uses Arc<Mutex<RoleGraph>> in existing code | Data corruption | Partially - existing patterns suggest it's safe | | FFF search can be called in parallel with KG search | Both are async | Performance not gained | Yes - tokio::join! works |
Research Findings
Key Insights
- LLM bridge is the critical path - Without it, RLM fallback cannot work. Must enable
llmfeature and configure real client. - Hybrid search already has components - KgPathScorer, HaystackProvider, RoleGraph all exist; need orchestration.
- CLI structure exists - terraphim_cli can be extended with new
Grepcommand. - Structured signatures don't exist - Need new
RlmSignaturetrait for typed outputs.
Relevant Prior Art
- rlmgrep: Loads ALL files into LLM context; $0.01-0.05/query; no learning
- ripgrep: Fast regex search; no semantic understanding
- GitHub Copilot Chat: LLM-based but not searchable; no learning
Technical Spikes Needed
| Spike | Purpose | Estimated Effort | |-------|---------|------------------| | LLM bridge with terraphim_service | Verify chat_completion works for RLM queries | 4 hours | | Hybrid search parallelization | Verify tokio::join! scales across 3 haystacks | 2 hours | | Sufficiency judge heuristics | Test coverage/diversity thresholds | 2 hours |
Recommendations
Proceed/No-Proceed
Proceed - All critical components exist; only orchestration and integration needed.
Scope Recommendations
- Create new crate
terraphim_grepfor the hybrid search + RLM logic - Extend existing
terraphim_cliwithgrepsubcommand - Phase 1 (LLM bridge) is prerequisite for all other phases
Risk Mitigation Recommendations
- Start with LLM bridge verification - Don't build hybrid search on broken LLM
- Use existing LlmClient from terraphim_service - Don't reinvent HTTP client
- Test sufficiency judge with real queries - Heuristics may need tuning
Next Steps
If approved:
- Phase 2: Create implementation plan with file-level changes
- Step 1: Add
grepsubcommand to terraphim_cli - Step 2: Create
HybridSearcherorchestrating existing components - Step 3: Implement
SufficiencyJudge - Step 4: Wire RLM via terraphim_service::llm::LlmClient
- Step 5: Add
RlmSignaturetrait and implementations - Step 6: Implement KG curation loop
Appendix
Code Location Map
terraphim-grep/
├── CLI Extension
│ └── crates/terraphim_cli/src/main.rs # Add Grep subcommand
│
├── Core Logic (new crate: terraphim_grep)
│ └── crates/terraphim_grep/src/
│ ├── lib.rs # Module root
│ ├── hybrid_searcher.rs # Parallel search orchestration
│ ├── sufficiency_judge.rs # Heuristic + LLM judge
│ ├── rlm_context.rs # Context building for RLM
│ ├── signatures.rs # RlmSignature trait
│ ├── kg_curation.rs # RLM → KG feedback loop
│ └── error.rs # Error types
│
├── Integration Points
│ ├── crates/terraphim_rlm/src/llm_bridge.rs # LLM bridge (needs llm feature)
│ ├── crates/terraphim_rolegraph/src/lib.rs # RoleGraph (reads/writes)
│ ├── crates/terraphim_file_search/src/kg_scorer.rs # KgPathScorer
│ └── crates/terraphim_service/src/llm.rs # LlmClient trait
│
└── Research Reference
└── docs/research-terraphim-intelligent-grep.md # Original researchReference Materials
crates/terraphim_rlm/src/llm_bridge.rs:199-261- LLM query methodcrates/terraphim_rolegraph/src/lib.rs:1-200- RoleGraph architecturecrates/haystack_core/src/lib.rs:8-13- HaystackProvider traitcrates/terraphim_file_search/src/kg_scorer.rs:45-70- KgPathScorer scoringcrates/terraphim_service/src/llm.rs:32-67- LlmClient trait