Duplicate Handling in Terraphim AI
Overview
This document explains how Terraphim AI handles duplicate results when searching across multiple haystacks (data sources), particularly when using combinations like QueryRs and GrepApp for code search.
Current Behavior
How Results Are Merged
When a role has multiple haystacks configured, Terraphim AI searches each haystack independently and merges the results into a single index. The merging process is implemented in crates/terraphim_middleware/src/indexer/mod.rs:
pub async Key Points
- HashMap-Based Merging: Results are stored in a
HashMap<String, Document>where the key is the document ID - Document ID Uniqueness: Each haystack generates its own document IDs:
- QueryRs: Uses URLs from API responses (e.g.,
https://docs.rs/tokio/...) - GrepApp: Uses format
grepapp:{repo}:{branch}:{path}(e.g.,grepapp:tokio_tokio_main_src_lib.rs)
- QueryRs: Uses URLs from API responses (e.g.,
- Last-Wins Strategy: When using
HashMap::extend(), if two documents have the same ID, the last one (from the most recently processed haystack) overwrites the previous one - No Explicit Deduplication: There is no automatic deduplication based on URL normalization or content similarity
Source Tracking
Every document is tagged with its source haystack via the source_haystack field:
- Example:
"source_haystack": "https://query.rs" - Example:
"source_haystack": "https://grep.app"
This allows users to see which haystack provided each result.
Duplicate Scenarios
Scenario 1: Same File from Different Sources
Example: Searching for "tokio spawn" with Rust Engineer role (QueryRs + GrepApp)
- QueryRs returns:
https://github.com/tokio-rs/tokio/blob/master/src/task/mod.rs - GrepApp returns: Same file with ID
grepapp:tokio-rs_tokio_master_src_task_mod.rs
Current Behavior: Both results appear in search results as separate documents because they have different document IDs.
User Impact:
- β Users can see the same result from multiple sources
- β οΈ Results may appear duplicated in the UI
- βΉοΈ The
source_haystackfield distinguishes the sources
Scenario 2: URL Duplicates
Example: Same URL returned by two haystacks with different parameters
Current Behavior: If the exact same URL is returned but with different document IDs, both results appear.
User Impact:
- May see identical links
- Different snippets or metadata might provide additional context
- Relevance scoring applied to each result independently
Scenario 3: Content Duplicates
Example: Different URLs pointing to the same content (mirrors, forks, copies)
Current Behavior: No content-based deduplication. Each unique URL is treated as a separate document.
Relevance Function Behavior
All relevance functions tested show the same duplicate handling behavior since deduplication happens (or doesn't happen) at the indexing level before relevance scoring:
| Relevance Function | Duplicate Handling Behavior | |--------------------|----------------------------| | TitleScorer | No deduplication; all results scored independently | | BM25 | No deduplication; TF-IDF scoring applied to all results | | BM25F | No deduplication; field-weighted scoring applied to all | | BM25Plus | No deduplication; enhanced BM25 applied to all | | TerraphimGraph | Not applicable; uses single KG source, not multiple remote haystacks |
Test Results Example
Query: "tokio spawn"
TitleScorer:
Total: 18, Unique URLs: 16, Duplicates: 2
QueryRs: 9, GrepApp: 9
BM25:
Total: 18, Unique URLs: 16, Duplicates: 2
QueryRs: 9, GrepApp: 9Implementation Details
Document ID Generation
QueryRs Haystack
// Uses URL from API response
let doc_id = doc.url.clone;GrepApp Haystack
// Constructs ID from repo, branch, and path
let doc_id = format!;Source Tagging
document.source_haystack = Some;Known Limitations
- No URL Normalization: URLs with different query parameters or fragments are treated as different documents
- No Content Hashing: Identical content at different URLs appears as separate results
- No Fuzzy Matching: Similar titles or snippets don't trigger deduplication
- Last-Wins Overwriting: If somehow two haystacks generate the same document ID, only the last one is kept
User Recommendations
For Users
- Use Source Filters: Check the
source_haystackfield to understand where results come from - Limit Results: Use the
limitparameter to control result set size - Review All Sources: Duplicates from different sources may have different snippets or context
- URL Comparison: Manually compare URLs to identify duplicates
For Developers
- Unique ID Generation: Ensure each haystack generates truly unique document IDs
- Consistent Formatting: Maintain consistent URL formatting across haystacks
- Source-Specific Metadata: Leverage
source_haystackfor filtering and grouping
Future Enhancement Opportunities
Potential Deduplication Strategies
- URL Normalization
- Content-Based Hashing
- GitHub URL Detection
- Fuzzy Title Matching
- Post-Processing Deduplication
Testing
Comprehensive tests for duplicate handling are available in:
terraphim_server/tests/relevance_functions_duplicate_test.rsterraphim_server/tests/rust_engineer_enhanced_integration_test.rs
Run tests with:
# Run duplicate handling tests
# Run Rust Engineer dual haystack test
Examples
Configuration Example
Rust Engineer role with multiple haystacks:
Search Result Example
Conclusion
Terraphim AI's current duplicate handling strategy prioritizes:
- Transparency: All results are shown with source attribution
- Completeness: No potentially valuable results are filtered out
- Simplicity: Straightforward merging logic without complex deduplication
This approach allows users to see all relevant results from multiple sources, with the trade-off of potential duplication. The source_haystack field enables users to understand and manage duplicates as needed.
For applications requiring strict deduplication, custom post-processing or configuration adjustments (limiting to a single haystack) may be appropriate.
Last Updated: 2025-11-14 Test Coverage: β Comprehensive integration tests available Status: Current behavior documented and tested