terraphim_automata - Text Matching and Autocomplete Engine
Overview
terraphim_automata provides high-performance text processing using Aho-Corasick automata and finite state transducers (FST). It powers Terraphim's autocomplete and knowledge graph linking features with support for fuzzy matching, link generation, and paragraph extraction.
Domain Model
Core Concepts
AutomataPath
Path to a thesaurus/automata file, either local or remote.
pub enum AutomataPath {
Local(PathBuf),
Remote(String),
}
Key Responsibilities:
- Abstract file location (local or remote)
- Support HTTP/HTTPS URLs
- Provide consistent interface
Thesaurus
Mapping from normalised terms to concepts with metadata.
pub struct Thesaurus {
pub name: String,
pub terms: AHashMap<NormalisedTermValue, NormalisedTerm>,
}
Key Responsibilities:
- Store term-to-concept mappings
- Support synonym lookups
- Provide iteration interface
LinkType
Format for link generation in text replacement.
pub enum LinkType {
MarkdownLinks,
HTMLLinks,
WikiLinks,
}
Key Responsibilities:
- Define output format for matches
- Support multiple link types
- Enable custom formatting
Autocomplete
AutocompleteIndex
FST-based prefix search index with metadata.
pub struct AutocompleteIndex {
pub fst: fst::Map<Vec<u8>>,
pub metadata: AHashMap<String, AutocompleteMetadata>,
}
Key Responsibilities:
- Fast prefix-based search
- Store term metadata
- Support fuzzy matching
AutocompleteMetadata
Metadata associated with autocomplete terms.
pub struct AutocompleteMetadata {
pub id: u64,
pub url: Option<String>,
pub display_value: Option<String>,
}
Key Responsibilities:
- Store term identifiers
- Link to external resources
- Preserve display values
AutocompleteResult
Result from autocomplete search.
pub struct AutocompleteResult {
pub term: String,
pub score: f64,
pub metadata: AutocompleteMetadata,
}
Key Responsibilities:
- Return matched term
- Provide relevance score
- Include metadata
Text Matching
Matched
Single text match with location information.
pub struct Matched {
pub matched: String,
pub start: usize,
pub end: usize,
pub metadata: NormalisedTerm,
}
Key Responsibilities:
- Store matched text
- Track position in source
- Provide term metadata
MatchedIterator
Iterator over all matches in text.
pub struct MatchedIterator<'a> {
}
Key Responsibilities:
- Iterate over matches
- Preserve order
- Efficient traversal
Data Models
Thesaurus Builder
ThesaurusBuilder
Builder pattern for constructing thesaurus from sources.
pub trait ThesaurusBuilder: Send + Sync {
async fn build(
&self,
role: String,
path: String
) -> Result<Thesaurus>;
}
Implementations:
Logseq: Build from Logseq markdown files
JsonThesaurusBuilder: Build from JSON files
Logseq
Logseq-specific thesaurus builder.
pub struct Logseq {
pub path: String,
}
Key Responsibilities:
- Parse Logseq markdown files
- Extract term definitions
- Build thesaurus structure
Implementation Patterns
Thesaurus Loading
Local File Loading
pub fn load_thesaurus(automata_path: &AutomataPath) -> Result<Thesaurus> {
let contents = match automata_path {
AutomataPath::Local(path) => fs::read_to_string(path)?,
AutomataPath::Remote(_) => {
return Err(TerraphimAutomataError::InvalidThesaurus(
"Remote loading is not supported. Enable 'remote-loading' feature.".to_string(),
));
}
};
parse_thesaurus_json(&contents)
}
Pattern:
- Match on path type
- Read file contents
- Parse JSON structure
- Handle errors gracefully
Remote Loading (feature: "remote-loading")
pub async fn load_thesaurus(automata_path: &AutomataPath) -> Result<Thesaurus> {
async fn read_url(url: String) -> Result<String> {
log::debug!("Reading thesaurus from remote: {url}");
let response = reqwest::Client::builder()
.timeout(std::time::Duration::from_secs(30))
.user_agent("Terraphim-Automata/1.0")
.build()
.unwrap_or_else(|_| reqwest::Client::new())
.get(url.clone())
.header("Accept", "application/json")
.send()
.await
.map_err(|e| {
TerraphimAutomataError::InvalidThesaurus(format!(
"Failed to fetch thesaurus from remote {url}. Error: {e:#?}",
))
})?;
let status = response.status();
let headers = response.headers().clone();
let body = response.text().await;
match body {
Ok(text) => Ok(text),
Err(e) => {
let error_info = format!(
"Failed to read thesaurus from remote {url}. Status: {status}. Headers: {headers:#?}. Error: {e:#?}",
);
Err(TerraphimAutomataError::InvalidThesaurus(error_info))
}
}
}
let contents = match automata_path {
AutomataPath::Local(path) => {
if !std::path::Path::new(path).exists() {
return Err(TerraphimAutomataError::InvalidThesaurus(format!(
"Thesaurus file not found: {}",
path.display()
)));
}
fs::read_to_string(path)?
}
AutomataPath::Remote(url) => read_url(url.clone()).await?,
};
parse_thesaurus_json(&contents)
}
Pattern:
- Async HTTP request with timeout
- Error handling with context
- Status and header logging
- Graceful degradation
Autocomplete Building
Index Construction
pub fn build_autocomplete_index(
thesaurus: &Thesaurus,
config: Option<AutocompleteConfig>
) -> Result<AutocompleteIndex> {
let mut builder = fst::MapBuilder::new();
for (key, term) in thesaurus {
let metadata = AutocompleteMetadata {
id: term.id,
url: term.url.clone(),
display_value: term.display_value.clone(),
};
let metadata_json = serde_json::to_string(&metadata)?;
let mut entry = Vec::new();
entry.extend_from_slice(key.as_str().as_bytes());
entry.extend_from_slice(b"\0");
entry.extend_from_slice(metadata_json.as_bytes());
builder.insert(entry)?;
}
let fst_bytes = builder.into_inner()?;
let fst = fst::Map::new(fst_bytes)?;
let mut metadata = AHashMap::new();
for (key, term) in thesaurus {
let meta = AutocompleteMetadata {
id: term.id,
url: term.url.clone(),
display_value: term.display_value.clone(),
};
metadata.insert(key.as_str().to_string(), meta);
}
Ok(AutocompleteIndex { fst, metadata })
}
Pattern:
- Create FST builder
- Insert term + metadata pairs
- Use null byte separator
- Build separate metadata map
Prefix Search
pub fn autocomplete_search(
index: &AutocompleteIndex,
prefix: &str,
limit: Option<usize>
) -> Vec<AutocompleteResult> {
let mut results = Vec::new();
let mut op = index.fst.stream();
op.seek(prefix.as_bytes()).ok();
while let Some((key, _value)) = op.next() {
if !key.starts_with(prefix.as_bytes()) {
break;
}
let parts: Vec<&[u8]> = key.splitn(0).collect();
if parts.len() < 2 {
continue;
}
let term_str = String::from_utf8_lossy(parts[0]);
let metadata_json = String::from_utf8_lossy(parts[1]);
if let Ok(metadata) = serde_json::from_str::<AutocompleteMetadata>(&metadata_json) {
results.push(AutocompleteResult {
term: term_str,
score: 1.0,
metadata,
});
if let Some(limit) = limit {
if results.len() >= limit {
break;
}
}
}
}
results
}
Pattern:
- Seek to prefix in FST
- Stream matching entries
- Parse metadata from value
- Apply result limit
- Handle UTF-8 loss gracefully
Fuzzy Matching
Jaro-Winkler Similarity
pub fn fuzzy_autocomplete_search_jaro_winkler(
index: &AutocompleteIndex,
query: &str,
threshold: f64,
limit: Option<usize>
) -> Result<Vec<AutocompleteResult>> {
let mut results = Vec::new();
for (term_str, metadata) in &index.metadata {
let distance = jaro_winkler_similarity(query, term_str);
if distance >= threshold {
results.push(AutocompleteResult {
term: term_str.clone(),
score: distance,
metadata: metadata.clone(),
});
if let Some(limit) = limit {
if results.len() >= limit {
break;
}
}
}
}
results.sort_by(|a, b| b.score.partial_cmp(&a.score).unwrap_or(std::cmp::Ordering::Equal));
Ok(results)
}
fn jaro_winkler_similarity(s1: &str, s2: &str) -> f64 {
}
Pattern:
- Iterate over all terms
- Compute Jaro-Winkler similarity
- Filter by threshold
- Sort by score
- Apply limit
Levenshtein Distance
pub fn fuzzy_autocomplete_search_levenshtein(
index: &AutocompleteIndex,
query: &str,
threshold: f64,
limit: Option<usize>
) -> Result<Vec<AutocompleteResult>> {
let mut results = Vec::new();
for (term_str, metadata) in &index.metadata {
let distance = levenshtein_distance(query, term_str);
let max_len = query.len().max(term_str.len());
let similarity = 1.0 - (distance as f64 / max_len as f64);
if similarity >= threshold {
results.push(AutocompleteResult {
term: term_str.clone(),
score: similarity,
metadata: metadata.clone(),
});
if let Some(limit) = limit {
if results.len() >= limit {
break;
}
}
}
}
results.sort_by(|a, b| b.score.partial_cmp(&a.score).unwrap_or(std::cmp::Ordering::Equal));
Ok(results)
}
Pattern:
- Compute Levenshtein distance
- Normalise by maximum length
- Filter and sort by similarity
- Apply limit
Text Matching and Replacement
Finding Matches
pub fn find_matches(
text: &str,
thesaurus: &Thesaurus
) -> Result<Vec<Matched>> {
let (ac, values, ac_reverse_nterm) = build_aho_corasick(thesaurus)?;
let mut matches = Vec::new();
for mat in ac.find_iter(text) {
let term_id = values[mat.pattern()];
let nterm = ac_reverse_nterm.get(&term_id)
.ok_or_else(|| {
TerraphimAutomataError::Dict(format!(
"No reverse lookup for term ID {}",
term_id
))
})?;
matches.push(Matched {
matched: mat.pattern().to_string(),
start: mat.start(),
end: mat.end(),
metadata: NormalisedTerm {
id: term_id,
value: nterm.clone(),
display_value: None,
url: None,
},
});
}
Ok(matches)
}
Pattern:
- Build Aho-Corasick automata
- Iterate over matches
- Reverse lookup term metadata
- Return structured results
Replacing Matches
pub fn replace_matches(
content: &str,
thesaurus: &Thesaurus,
link_type: LinkType
) -> Result<Vec<u8>> {
let matches = find_matches(content, thesaurus)?;
if matches.is_empty() {
return Ok(content.as_bytes().to_vec());
}
let mut result = Vec::new();
let mut last_end = 0;
for mat in &matches {
result.extend_from_slice(content[last_end..mat.start].as_bytes());
let link = generate_link(mat, link_type);
result.extend_from_slice(link.as_bytes());
last_end = mat.end;
}
result.extend_from_slice(content[last_end..].as_bytes());
Ok(result)
}
fn generate_link(mat: &Matched, link_type: &LinkType) -> String {
let display = mat.metadata.display();
let url = mat.metadata.url.as_deref().unwrap_or("");
match link_type {
LinkType::MarkdownLinks => {
if url.is_empty() {
format!("[{}]", display)
} else {
format!("[{}]({})", display, url)
}
}
LinkType::HTMLLinks => {
if url.is_empty() {
format!("<span>{}</span>", display)
} else {
format!("<a href=\"{url}\">{display}</a>", url, display)
}
}
LinkType::WikiLinks => {
format!("[[{}]]", display)
}
}
}
Pattern:
- Find all matches first
- Iterate in order
- Replace with generated links
- Preserve unmatched text
- Handle empty URLs gracefully
Paragraph Extraction
Extracting Context
pub fn extract_paragraphs_from_automata(
content: &str,
thesaurus: &Thesaurus,
context_lines: usize
) -> Result<Vec<String>> {
let matches = find_matches(content, thesaurus)?;
let mut paragraphs = Vec::new();
for mat in &matches {
let paragraph = extract_paragraph_around_match(content, mat, context_lines);
paragraphs.push(paragraph);
}
Ok(paragraphs)
}
fn extract_paragraph_around_match(
content: &str,
mat: &Matched,
context_lines: usize
) -> String {
let lines: Vec<&str> = content.lines().collect();
let match_line_idx = content[..mat.start()]
.matches(char::is_whitespace)
.count();
let start_idx = match_line_idx.saturating_sub(0);
let end_idx = (match_line_idx + context_lines + 1).min(lines.len());
lines[start_idx..end_idx].join("\n")
}
Pattern:
- Find all matches
- Extract surrounding context
- Use line-based extraction
- Apply context window
Error Handling
Error Types
#[derive(thiserror::Error, Debug)]
pub enum TerraphimAutomataError {
#[error("Invalid thesaurus: {0}")]
InvalidThesaurus(String),
#[error("Serde deserialization error: {0}")]
Serde(#[from] serde_json::Error),
#[error("Dict error: {0}")]
Dict(String),
#[error("IO error: {0}")]
Io(#[from] std::io::Error),
#[error("Aho-Corasick build error: {0}")]
AhoCorasick(#[from] aho_corasick::BuildError),
#[error("FST error: {0}")]
Fst(#[from] fst::Error),
}
Categories:
- Validation: Invalid thesaurus format
- Serialisation: JSON parsing errors
- I/O: File system errors
- Automata: Aho-Corasick and FST errors
Performance Optimisations
FST Indexing
Memory Layout
let mut entry = Vec::new();
entry.extend_from_slice(key.as_str().as_bytes());
entry.extend_from_slice(b"\0");
entry.extend_from_slice(metadata_json.as_bytes());
Optimisation:
- Store term and metadata together
- Use null byte as separator
- Fast prefix matching via FST
- Compact storage format
Caching
Metadata Storage
pub struct AutocompleteIndex {
pub fst: fst::Map<Vec<u8>>,
pub metadata: AHashMap<String, AutocompleteMetadata>,
}
Optimisation:
- FST for fast prefix search
- Separate map for metadata
- Fast lookup by term
- Avoid JSON parsing during search
Lazy Evaluation
Streaming Iterator
pub fn autocomplete_search(
index: &AutocompleteIndex,
prefix: &str,
limit: Option<usize>
) -> Vec<AutocompleteResult> {
let mut results = Vec::new();
let mut op = index.fst.stream();
op.seek(prefix.as_bytes()).ok();
while let Some((key, _value)) = op.next() {
}
results
}
Optimisation:
- Stream results from FST
- Process incrementally
- Early exit on limit
- Avoid full iteration
Testing Patterns
Unit Tests
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_load_thesaurus_from_json() {
let json_str = r#"{
"name": "Engineering",
"data": {
"rust": {
"id": 1,
"nterm": "rust programming",
"url": "https://rust-lang.org"
}
}
}"#;
let thesaurus = load_thesaurus_from_json(json_str).unwrap();
assert_eq!(thesaurus.len(), 1);
}
#[test]
fn test_autocomplete_search() {
let mut thesaurus = Thesaurus::new("test".to_string());
thesaurus.insert(
NormalisedTermValue::from("rust"),
NormalisedTerm::new(1, NormalisedTermValue::from("rust"))
);
let index = build_autocomplete_index(&thesaurus, None).unwrap();
let results = autocomplete_search(&index, "ru", None);
assert!(!results.is_empty());
assert!(results[0].term.starts_with("ru"));
}
#[test]
fn test_fuzzy_search() {
let mut thesaurus = Thesaurus::new("test".to_string());
thesaurus.insert(
NormalisedTermValue::from("programming"),
NormalisedTerm::new(1, NormalisedTermValue::from("programming"))
);
let index = build_autocomplete_index(&thesaurus, None).unwrap();
let results = fuzzy_autocomplete_search_jaro_winkler(
&index,
"programin",
0.8,
None
).unwrap();
assert!(!results.is_empty());
assert!(results[0].score >= 0.8);
}
#[test]
fn test_replace_matches() {
let mut thesaurus = Thesaurus::new("test".to_string());
thesaurus.insert(
NormalisedTermValue::from("rust"),
NormalisedTerm::with_auto_id(
NormalisedTermValue::from("rust")
).with_url("https://rust-lang.org".to_string())
);
let text = "I love rust!";
let replaced = replace_matches(
text,
&thesaurus,
LinkType::MarkdownLinks
).unwrap();
let result_str = String::from_utf8(replaced).unwrap();
assert!(result_str.contains("[rust](https://rust-lang.org)"));
}
}
Best Practices
Thesaurus Design
- Use unique IDs consistently
- Normalise terms case-insensitively
- Provide display values for output
- Support external resource links
Autocomplete
- Use FST for prefix matching
- Separate metadata storage
- Support fuzzy matching
- Apply sensible limits
Text Matching
- Use Aho-Corasick for efficiency
- Handle overlapping matches correctly
- Preserve original text structure
- Support multiple link formats
Error Handling
- Provide context in errors
- Categorise error types
- Use
thiserror for consistency
- Handle recoverable errors gracefully
Future Enhancements
Planned Features
Stemming Support
pub fn autocomplete_search_stemmed(
index: &AutocompleteIndex,
prefix: &str,
stemmer: &dyn Stemmer
) -> Vec<AutocompleteResult> {
}
Phonetic Matching
pub fn fuzzy_autocomplete_search_phonetic(
index: &AutocompleteIndex,
query: &str,
limit: Option<usize>
) -> Vec<AutocompleteResult> {
}
Context-Aware Matching
pub fn find_matches_with_context(
content: &str,
thesaurus: &Thesaurus,
window_size: usize
) -> Result<Vec<MatchedWithContext>> {
}
References