Terraphim Knowledge Graph System

Overview

The Terraphim Knowledge Graph (KG) system provides semantic search capabilities by building thesauri from markdown files and using graph-based ranking algorithms. The system converts synonym relationships into graph structures that dramatically improve search relevance and discoverability.

Architecture Components

Core Components

Logseq Builder - Extracts synonyms from markdown files using synonyms:: syntax
Thesaurus - Maps synonyms to normalized concept terms with unique IDs
RoleGraph - Graph structure with nodes, edges, and documents for ranking
TerraphimGraph Relevance Function - Graph-based scoring algorithm
Knowledge Graph Local - Local markdown file processing for KG construction

Knowledge Graph Construction

Source Files

Knowledge graphs are built from markdown files in docs/src/kg/:

docs/src/kg/
├── terraphim-graph.md    # Graph architecture concepts
├── service.md           # Service definitions
├── haystack.md          # Haystack integration
├── bug-reporting.md     # Bug reporting terminology and structured analysis
├── issue-tracking.md    # Domain-specific issue tracking terminology
└── [additional KG files]

Synonym Syntax

Markdown files use the synonyms:: syntax to define concept relationships:

# Terraphim-graph

## Terraphim Graph scorer

Terraphim Graph (scorer) is using unique graph embeddings.

[example] synonyms:: graph embeddings, graph, knowledge graph based embeddings

Now we will have a concept "Terraphim Graph Scorer" with synonyms.

Thesaurus Construction

The Logseq builder processes markdown files to create thesaurus mappings:

let logseq_builder = Logseq::default();
let thesaurus = logseq_builder
    .build("Terraphim Engineer".to_string(), kg_path)
    .await?;

Example thesaurus output:

'terraphim-graph' -> 'terraphim-graph' (ID: 3)
'graph embeddings' -> 'terraphim-graph' (ID: 3)
'graph' -> 'terraphim-graph' (ID: 3)
'knowledge graph based embeddings' -> 'terraphim-graph' (ID: 3)
'haystack' -> 'haystack' (ID: 1)
'service' -> 'service' (ID: 2)

Graph Structure

RoleGraph Components

The RoleGraph converts thesaurus data into searchable graph structures:

pub struct RoleGraph {
    pub role: RoleName,
    nodes: AHashMap<u64, Node>,           // Concept nodes
    edges: AHashMap<u64, Edge>,           // Connections between concepts
    documents: AHashMap<String, IndexedDocument>, // Indexed content
    pub thesaurus: Thesaurus,             // Synonym mappings
    pub ac: AhoCorasick,                  // Fast pattern matching
}

Node Structure

Each node represents a concept with connections:

pub struct Node {
    pub id: u64,                    // Unique concept ID
    pub rank: u64,                  // Importance score
    pub connected_with: HashSet<u64>, // Edge IDs connecting this node
}

Edge Structure

Edges connect concepts and track document associations:

pub struct Edge {
    pub id: u64,                           // Unique edge ID
    pub rank: u64,                         // Connection strength
    pub doc_hash: HashMap<String, u64>,    // Documents referencing this edge
}

Search and Ranking Algorithm

TerraphimGraph Relevance Function

The TerraphimGraph relevance function uses graph structure for ranking:

Pattern Matching - Find synonym matches in query text using Aho-Corasick
Node Discovery - Map matched terms to concept nodes via thesaurus
Edge Traversal - Follow connections between related concepts
Rank Calculation - Combine node rank + edge rank + document rank
Result Aggregation - Sort by total rank and return top results

Ranking Formula

let total_rank = node.rank + edge.rank + document_rank;

The ranking rewards:

Concept Importance (node.rank) - How central the concept is
Connection Strength (edge.rank) - How strongly concepts are related
Document Relevance (document_rank) - How relevant the document is

Query Processing

pub fn query_graph(
    &self,
    query_string: &str,
    offset: Option<usize>,
    limit: Option<usize>,
) -> Result<Vec<(String, IndexedDocument)>> {
    // 1. Find matching node IDs using Aho-Corasick
    let node_ids = self.find_matching_node_ids(query_string);

    // 2. Traverse graph structure for each matched node
    for node_id in node_ids {
        let node = self.nodes.get(&node_id)?;

        // 3. Follow edges to find connected documents
        for edge_id in &node.connected_with {
            let edge = self.edges.get(edge_id)?;

            // 4. Calculate combined ranking
            for (document_id, document_rank) in &edge.doc_hash {
                let total_rank = node.rank + edge.rank + document_rank;
                // Aggregate results...
            }
        }
    }

    // 5. Sort by rank and return top results
    ranked_documents.sort_by_key(|(_, doc)| std::cmp::Reverse(doc.rank));
    Ok(documents)
}

Performance Characteristics

Search Performance

Based on comprehensive testing:

Initial KG State: 10 terms, 3 nodes, 5 edges
Query Response: Consistent rank 34 for "terraphim-graph"
Search Speed: Fast pattern matching with Aho-Corasick
Memory Efficiency: Compact graph representation

Ranking Improvement

Adding synonyms creates dramatic ranking improvements:

| Metric | Before | After | Improvement | |--------|--------|-------|-------------| | Thesaurus Terms | 10 | 16 | +60% | | Graph Nodes | 3 | 4 | +33% | | Graph Edges | 5 | 8 | +60% | | "terraphim-graph" Rank | 28 | 117 | +318% |

Role Configuration

Terraphim Engineer Role

The Terraphim Engineer role uses local KG with TerraphimGraph relevance:

{
  "name": "Terraphim Engineer",
  "relevance_function": "terraphim-graph",
  "kg": {
    "knowledge_graph_local": {
      "input_type": "markdown",
      "path": "docs/src/kg"
    }
  }
}

Local vs Remote Thesaurus

Local KG (Recommended):

Built from docs/src/kg markdown files
10-16 terms from local content
Domain-specific, highly relevant
Fast building (~10 seconds)

Remote Thesaurus:

Downloaded from external URL
1,725+ terms from general content
May miss local domain terms
Network dependency

Implementation Examples

Building Knowledge Graph

use terraphim_middleware::thesaurus::{Logseq, ThesaurusBuilder};
use terraphim_rolegraph::RoleGraph;
use terraphim_types::RoleName;

// 1. Build thesaurus from local KG files
let logseq_builder = Logseq::default();
let thesaurus = logseq_builder
    .build("Terraphim Engineer".to_string(), kg_path)
    .await?;

// 2. Create rolegraph with thesaurus
let role_name = RoleName::new("Terraphim Engineer");
let mut rolegraph = RoleGraph::new(role_name, thesaurus).await?;

// 3. Index documents into rolegraph
rolegraph.insert_document(&document.id, document);

// 4. Search with graph-based ranking
let results = rolegraph.query_graph("terraphim-graph", Some(0), Some(10))?;

Adding New Knowledge

// Create new KG file with synonyms
let new_kg_content = r#"
# Graph Analysis

## Advanced Graph Processing

Graph Analysis provides deep insights into data relationships.

[example] synonyms:: data analysis, network analysis, graph processing,
          relationship mapping, connectivity analysis,
          terraphim-graph, graph embeddings

This enhances graph-based system capabilities.
"#;

// Write to KG directory
fs::write(&kg_path.join("graph-analysis.md"), new_kg_content).await?;

// Rebuild thesaurus to include new terms
let expanded_thesaurus = logseq_builder
    .build("Terraphim Engineer".to_string(), &kg_path)
    .await?;

Measuring Graph Growth

// Measure initial state
let initial_nodes = rolegraph.nodes_map().len();
let initial_edges = rolegraph.edges_map().len();
let initial_terms = thesaurus.len();

// ... add new content and rebuild ...

// Measure growth
let node_growth = expanded_nodes - initial_nodes;
let edge_growth = expanded_edges - initial_edges;
let term_growth = expanded_terms - initial_terms;

println!("Growth: +{} terms, +{} nodes, +{} edges",
    term_growth, node_growth, edge_growth);

Best Practices

Content Strategy

Domain-Specific Terms - Use terminology relevant to your domain
Synonym Research - Include terms users actually search for
Concept Mapping - Group related terms under common concepts
Strategic Placement - Add important synonyms to boost key terms

Performance Optimization

Local KG Preferred - Use local markdown files for domain relevance
Measured Growth - Track thesaurus and graph expansion metrics
Test-Driven - Validate ranking improvements with tests
Incremental Building - Add synonyms gradually and measure impact

Testing and Validation

Isolated Testing - Use temporary directories for safe testing
Baseline Measurement - Record initial state before changes
Impact Validation - Verify ranking improvements after additions
Regression Testing - Ensure changes don't break existing functionality

Troubleshooting

Common Issues

No Search Results:

Check if thesaurus contains expected terms
Verify role uses TerraphimGraph relevance function
Ensure KG path points to correct directory

Low Search Rankings:

Add more relevant synonyms to target concepts
Check synonym syntax in markdown files
Verify graph structure has sufficient connections

Build Failures:

Validate markdown file syntax
Check file permissions in KG directory
Ensure Logseq builder has access to files

Debug Information

// Print thesaurus contents
for (term, normalized_term) in &thesaurus {
    println!("'{}' -> '{}' (ID: {})",
        term.as_str(),
        normalized_term.value.as_str(),
        normalized_term.id);
}

// Check graph structure
println!("Nodes: {}, Edges: {}, Documents: {}",
    rolegraph.nodes_map().len(),
    rolegraph.edges_map().len(),
    rolegraph.get_all_documents().count());

// Test search functionality
let results = rolegraph.query_graph("test-term", Some(0), Some(5))?;
println!("Search results: {} found", results.len());

Bug Reporting and Issue Tracking Enhancement (2025-01-31)

Domain-Specific Knowledge Graph Files

The Terraphim KG system has been enhanced with comprehensive bug reporting and issue tracking terminology:

bug-reporting.md - Core bug reporting concepts:

Steps to Reproduce - Comprehensive synonyms for reproduction procedures
Expected Behaviour - Terminology for intended system behavior
Actual Behaviour - Variations for describing observed problems
Impact Analysis - Business and operational impact terminology
Bug Classification - Issue categorization and severity terms
Quality Assurance - QA processes and testing terminology

issue-tracking.md - Domain-specific terminology:

Payroll System Issues - Salary calculation and compensation problems
Data Consistency Problems - Synchronization and integrity issues
HR System Integration - Human resources system connectivity
System Integration Failures - Cross-system communication problems
Performance Degradation - System slowdown and bottleneck terminology
User Experience Issues - UI/UX problem descriptions

MCP Integration Testing

Comprehensive test suite validates bug reporting functionality:

test_bug_report_extraction.rs - Core functionality testing:

Extracts 2,615 paragraphs from comprehensive bug reports
Extracts 165 paragraphs from short content scenarios
Tests all four bug report sections systematically
Validates connectivity analysis across related terms

test_kg_term_verification.rs - Knowledge graph validation:

Payroll terms: 3 suggestions (provider, service, middleware)
Data consistency terms: 9 suggestions (data analysis, network analysis, etc.)
Quality assurance terms: 9 suggestions (connectivity analysis, graph processing, etc.)

Performance Improvements

The enhanced knowledge graph demonstrates significant improvements in structured document analysis:

Semantic Understanding: Enhanced ability to process structured bug reports using semantic understanding rather than keyword matching
Domain Coverage: Comprehensive terminology coverage for technical documentation and issue tracking
Extraction Performance: Robust paragraph extraction across different content types and sizes
Term Recognition: Effective autocomplete functionality with expanded terminology

Future Enhancements

Planned Features

Dynamic KG Updates - Hot-reload KG changes without restart
Graph Visualization - Visual representation of concept relationships
Advanced Ranking - Machine learning-enhanced relevance scoring
Multi-Language Support - Synonym support for multiple languages
Performance Optimization - Caching and incremental updates
Domain Expansion - Additional specialized terminology for specific industries and use cases

Integration Opportunities

External Ontologies - Import from RDF/OWL knowledge bases
Collaborative Editing - Multi-user KG development workflows
Analytics Dashboard - Search analytics and KG health monitoring
API Extensions - RESTful APIs for KG management

Conclusion

The Terraphim Knowledge Graph system provides powerful semantic search capabilities through graph-based ranking. By converting synonym relationships into graph structures, the system dramatically improves search relevance and provides a framework for continuous improvement through strategic content additions.

Key Benefits:

🔍 Semantic Search - Find content by meaning, not just keywords
📈 Ranking Improvement - Up to 318% ranking boost from synonyms
🎯 Domain Relevance - Local KG ensures domain-specific accuracy
🔧 Easy Expansion - Simple markdown syntax for adding knowledge
📊 Measurable Impact - Comprehensive testing framework for validation

The knowledge graph system forms the foundation for intelligent, context-aware search in the Terraphim AI platform.