Query.rs and Reddit Content Fetching Fix Plan
Problem Statement
The test_query_rs_crates_search test is failing because:
- Query.rs search returns mixed results (actual crates + Reddit posts about crates)
- Reddit posts don't have crates.io API data fields (Description, Downloads, etc.)
- Content is potentially being fetched multiple times for the same URL
- No configurable option to control whether to fetch full content
Solution Architecture
1. Add Fetch Content Configuration Parameter
Location: terraphim_types::Haystack
pub struct Haystack {
pub location: String,
pub service: ServiceType,
pub read_only: Option<bool>,
pub atomic_server_secret: Option<String>,
pub extra_parameters: Option<Map<String, String>>,
pub fetch_content: Option<bool>, }
Default Behavior: fetch_content = false (lightweight, metadata only)
2. Implement URL Deduplication with HashMap
Location: terraphim_middleware/src/haystack/query_rs.rs
Add a URL tracking cache to the indexer:
use std::collections::HashSet;
use std::sync::{Arc, Mutex};
pub struct QueryRsHaystackIndexer {
client: reqwest::Client,
fetched_urls: Arc<Mutex<HashSet<String>>>, }
impl QueryRsHaystackIndexer {
pub fn new() -> Self {
Self {
client: reqwest::Client::new(),
fetched_urls: Arc::new(Mutex::new(HashSet::new())),
}
}
fn should_fetch(&self, url: &str) -> bool {
let mut cache = self.fetched_urls.lock().unwrap();
if cache.contains(url) {
log::debug!("⏭️ Skipping already fetched URL: {}", url);
false
} else {
cache.insert(url.to_string());
true
}
}
}
3. Differentiate Between Crates.io and Reddit Results
Strategy: Detect result type and apply appropriate fetching logic
enum QueryRsResultType {
CratesIo { crate_name: String },
Reddit { post_id: String },
Other { url: String },
}
impl QueryRsHaystackIndexer {
fn classify_result(&self, url: &str, title: &str) -> QueryRsResultType {
if url.contains("crates.io/crates/") {
let crate_name = extract_crate_name(url);
QueryRsResultType::CratesIo { crate_name }
} else if url.contains("reddit.com") || title.starts_with("[Reddit]") {
let post_id = extract_reddit_post_id(url);
QueryRsResultType::Reddit { post_id }
} else {
QueryRsResultType::Other { url: url.to_string() }
}
}
}
4. Conditional Content Fetching Logic
Implementation Flow:
async fn fetch_document_content(
&self,
url: &str,
title: &str,
result_type: QueryRsResultType,
fetch_content: bool,
) -> Result<Document> {
if !self.should_fetch(url) {
return Ok(create_cached_placeholder(url, title));
}
let body = match result_type {
QueryRsResultType::CratesIo { crate_name } => {
if fetch_content {
self.fetch_crate_api_data(&crate_name).await
.unwrap_or_else(|e| {
log::warn!("Failed to fetch crate API for {}: {}", crate_name, e);
format!("Crate: {} (API unavailable)", crate_name)
})
} else {
format!("Crate: {} - {}", crate_name, url)
}
}
QueryRsResultType::Reddit { post_id } => {
if fetch_content {
self.fetch_reddit_post(&post_id).await
.unwrap_or_else(|e| {
log::warn!("Failed to fetch Reddit post {}: {}", post_id, e);
format!("Reddit discussion: {}", title)
})
} else {
format!("Reddit: {}", title)
}
}
QueryRsResultType::Other { url } => {
if fetch_content {
self.fetch_generic_content(&url).await
.unwrap_or_else(|_| title.to_string())
} else {
title.to_string()
}
}
};
Ok(Document {
id: generate_document_id(&url),
url: url.to_string(),
title: title.to_string(),
body,
description: None,
stub: None,
tags: None,
rank: None,
})
}
5. Update Configuration Handling
Location: terraphim_middleware/src/haystack/query_rs.rs
async fn index(&self, haystack: &Haystack) -> Result<BTreeMap<String, Document>> {
let fetch_content = haystack.fetch_content.unwrap_or(false);
log::info!(
"🔍 Indexing query.rs haystack (fetch_content: {})",
fetch_content
);
{
let mut cache = self.fetched_urls.lock().unwrap();
cache.clear();
}
let query = extract_query_from_location(&haystack.location);
let search_results = self.search_query_rs(&query).await?;
let mut documents = BTreeMap::new();
for result in search_results {
let result_type = self.classify_result(&result.url, &result.title);
let doc = self.fetch_document_content(
&result.url,
&result.title,
result_type,
fetch_content,
).await?;
documents.insert(doc.id.clone(), doc);
}
log::info!("✅ Indexed {} documents from query.rs", documents.len());
Ok(documents)
}
6. Reddit-Specific API Integration
New Module: terraphim_middleware/src/haystack/reddit.rs
use serde::{Deserialize, Serialize};
#[derive(Debug, Deserialize)]
struct RedditPostData {
title: String,
selftext: String,
author: String,
score: i32,
num_comments: i32,
created_utc: f64,
}
impl QueryRsHaystackIndexer {
async fn fetch_reddit_post(&self, post_id: &str) -> Result<String> {
let url = format!("https://www.reddit.com/comments/{}.json", post_id);
let response: Value = self.client
.get(&url)
.header("User-Agent", "terraphim-indexer/0.2.0")
.send()
.await?
.json()
.await?;
let post_data = parse_reddit_response(response)?;
Ok(format!(
"Title: {}\n\nAuthor: u/{}\nScore: {} | Comments: {}\n\n{}",
post_data.title,
post_data.author,
post_data.score,
post_data.num_comments,
post_data.selftext
))
}
}
7. Update Test to Use Configuration
Location: crates/terraphim_middleware/src/tests/query_rs_haystack_test.rs
#[tokio::test]
async fn test_query_rs_crates_search() {
let indexer = QueryRsHaystackIndexer::default();
let haystack = Haystack {
location: "https://query.rs/graph".to_string(),
service: ServiceType::QueryRs,
read_only: Some(true),
atomic_server_secret: None,
extra_parameters: None,
fetch_content: Some(true), };
}
#[tokio::test]
async fn test_query_rs_lightweight_mode() {
let indexer = QueryRsHaystackIndexer::default();
let haystack = Haystack {
location: "https://query.rs/graph".to_string(),
service: ServiceType::QueryRs,
read_only: Some(true),
atomic_server_secret: None,
extra_parameters: None,
fetch_content: Some(false), };
match indexer.index(&haystack).await {
Ok(index) => {
println!("Found {} documents (lightweight mode)", index.len());
for (_id, doc) in index.iter() {
assert!(
doc.body.len() < 200,
"Lightweight mode should have minimal body content"
);
}
}
Err(e) => panic!("Failed to index: {:?}", e),
}
}
8. Crates.io API Enhancement
Existing Enhancement: Already has crates.io API fetching
async fn fetch_crate_api_data(&self, crate_name: &str) -> Result<String> {
let api_url = format!("https://crates.io/api/v1/crates/{}", crate_name);
let response: Value = self.client
.get(&api_url)
.header("User-Agent", "terraphim-indexer/0.2.0")
.send()
.await?
.json()
.await?;
let crate_data = response["crate"].as_object()
.ok_or_else(|| anyhow::anyhow!("Invalid crate data"))?;
let description = crate_data["description"]
.as_str()
.unwrap_or("No description");
let downloads = crate_data["downloads"]
.as_i64()
.unwrap_or(0);
let version = crate_data["max_version"]
.as_str()
.unwrap_or("unknown");
Ok(format!(
"Description: {}\nVersion: {}\nDownloads: {}\n",
description, version, downloads
))
}
Implementation Checklist
Phase 1: Core Infrastructure ✅
- [ ] Add
fetch_content field to Haystack struct in terraphim_types
- [ ] Add URL deduplication cache to
QueryRsHaystackIndexer
- [ ] Implement
should_fetch() method with HashMap tracking
- [ ] Update TypeScript bindings if needed
Phase 2: Result Classification 🔄
- [ ] Create
QueryRsResultType enum
- [ ] Implement
classify_result() method
- [ ] Add
extract_reddit_post_id() helper (if not exists)
- [ ] Add
extract_crate_name() helper (if not exists)
Phase 3: Conditional Fetching Logic 📝
- [ ] Refactor
fetch_document_content() to use fetch_content parameter
- [ ] Implement Reddit API fetching in separate module
- [ ] Add error handling with graceful degradation
- [ ] Add logging for fetch decisions
Phase 4: Testing & Validation ✅
- [ ] Update
test_query_rs_crates_search with fetch_content: true
- [ ] Add
test_query_rs_lightweight_mode for fetch_content: false
- [ ] Add
test_url_deduplication to verify HashSet works
- [ ] Add
test_mixed_results_classification for crates + Reddit
Phase 5: Documentation 📚
- [ ] Document
fetch_content parameter in README
- [ ] Add examples of both modes to user guide
- [ ] Update API documentation
- [ ] Add performance notes about caching
Configuration Examples
Lightweight Mode (Default)
{
"location": "https://query.rs/graph",
"service": "QueryRs",
"fetch_content": false
}
Result: Fast indexing, minimal content, just titles and URLs
Full Content Mode
{
"location": "https://query.rs/graph",
"service": "QueryRs",
"fetch_content": true
}
Result: Comprehensive indexing with API data, Reddit content, etc.
Performance Considerations
- URL Deduplication: O(1) lookup with HashSet
- Memory: HashSet grows with unique URLs (typically < 1000 per query)
- Network: Controlled by
fetch_content flag
- Rate Limiting: Add exponential backoff for API calls
- Caching: Consider persisting URL cache across runs (future enhancement)
Error Handling Strategy
- API Failures: Log warning, return minimal content, don't fail entire index
- Network Errors: Retry with backoff, then graceful degradation
- Parsing Errors: Log error, use title as fallback content
- Rate Limits: Respect 429 responses, implement exponential backoff
Testing Strategy
- Unit Tests: Each component (classification, fetching, deduplication)
- Integration Tests: Full indexing with mocked HTTP responses
- E2E Tests: Real API calls (marked with
#[ignore] by default)
- Performance Tests: Measure deduplication effectiveness
Migration Path
- Backward Compatibility:
fetch_content: None defaults to false
- Existing Configs: Continue to work without changes
- Opt-in Enhancement: Users enable
fetch_content: true when needed
- Deprecation: None required (additive change)
Files to Modify
crates/terraphim_types/src/lib.rs - Add fetch_content field
crates/terraphim_middleware/src/haystack/query_rs.rs - Main logic
crates/terraphim_middleware/src/haystack/reddit.rs - New module
crates/terraphim_middleware/src/tests/query_rs_haystack_test.rs - Tests
crates/terraphim_config/src/lib.rs - Config handling
desktop/src/lib/generated/types.ts - TypeScript bindings update
Success Criteria
✅ All tests pass including the currently failing crates search test
✅ URL deduplication prevents duplicate fetches (verified in logs)
✅ fetch_content: false mode is fast (< 1s for typical query)
✅ fetch_content: true mode provides comprehensive data
✅ Reddit posts are properly handled and formatted
✅ Crates.io API data is correctly fetched and formatted
✅ No breaking changes to existing configurations