Terraphim Graph Embeddings vs Cleora
1. Terraphim Graph (Current Implementation)
1.1 Data structures
RoleGraph(crateterraphim_rolegraph) maintains:nodes: HashMap<u64, Node>β concepts (unique KG terms)edges: HashMap<u64, Edge>β co-occurrences between concepts found while indexing documentsdocuments: HashMap<String, IndexedDocument>β corpus metadata- Aho-Corasick automaton for ultra-fast exact synonym matching (O(text) streaming)
1.2 Indexing / "Training"
- Parse each document β split sentences β run AC β list of matched node-IDs.
- For every consecutive pair
(x,y)update edge weight.<br/>edge.rank += 1;node.rank += 1;doc.rank += 1. - Ranks therefore equal frequency counts (integer, monotonic).
1.3 Query & Ranking
score(document) = Ξ£(node.rank + edge.rank + doc.rank) for all matches- Results are sorted by the summed integer score and returned.
- All operations are in-memory, lock-free (
ahash) and extremely fast for < ~1e5 nodes. - π TF-IDF Enhancement: For
TerraphimGraphrelevance function, documents are now re-scored using TF-IDF weighting (30% weight) combined with the original graph ranking, providing better semantic relevance while maintaining the fast graph-based foundation.
1.4 Strengths
β Zero-cost online updates (plain counters). β Deterministic and explainable (frequency == relevance). β Small memory footprint (no float vectors). β π Hybrid TF-IDF scoring improves semantic relevance while preserving graph structure benefits.
1.5 Limitations
β No semantic smoothing β synonyms outside thesaurus are unseen. β Ties & coarse granularity β many docs share identical sums (partially addressed by TF-IDF). β Ignores global topology (2-hop, motifs, community structure). β Ranking deteriorates for long-tail terms with low frequency (partially addressed by TF-IDF).
2. Cleora (State-of-the-art Rust Graph Embeddings)
Cleora https://github.com/Synerise/cleora is an industrial-scale algorithm that:
- Treats each relation as a separate sparse adjacency matrix.
- Initialises node vectors with random values on the unit sphere.
- Iteratively updates embeddings via Chebyshev polynomial mixed context (T iterations).
- Provides L2-normalised dense vectors (128-512 dims) that capture multi-hop semantics.
Key properties:
- Linear in number of edges, multi-threaded (Rayon) β can embed 1B edges & 100M nodes.<br/>
- Produces relation-specific or joint embeddings.
- Written in pure Rust β easy to vendor / build for WASM & native.
Performance (from paper / benchmarks):
- Hits@10 β 5-25 % over Node2Vec / DeepWalk on standard datasets.
- Training 10Γ faster than PyTorch-GPU baselines on CPU-only hardware.
3. Comparative Analysis
| Aspect | Terraphim Graph | Cleora |
| --- | --- | --- |
| Representation | Integer rank counters | Float32 dense vectors |
| Captures multi-hop semantics | β | β
(Chebyshev context) |
| Online updates | β
O(1) | β² requires re-embedding (mini-batch possible) |
| Memory | O(N+E) ints | O(NΒ·d) floats (dβ128) |
| Search | Exact match + sum | ANN (cosine / dot) |
| Explainability | High (counts) | Medium (vector) |
| WASM compatibility | β
| β
(no std::sync::atomic in wasm32) |
4. Proposed Improvement Roadmap
- Hybrid Scoring (Low effort)
- Keep current frequency rank as prior.
- Add optional Cleora cosine-similarity score
s_emb. - Final score:
s = Ξ± Β· z_norm(rank) + (1-Ξ±) Β· s_emb(Ξ±β0.3 tuned on dev set).
- Offline Embedding Build Step
- New bin
cargo run -p terraphim_rolegraph --bin build-embeddings. - Reads
thesaurus.jsonβ generatesembeddings.bin(NDArray). - Store in role config (
automata_pathsibling).
- New bin
- Runtime Integration
- Load embeddings into
HashMap<NodeId, Vec<f32>>. - On query:
- Fetch matched node IDs using AC (unchanged).
- Average their vectors β query embedding.
- HNSW/ANNOY search over pre-built vector index to retrieve top-k docs.
- Merge with prior ranks.
- Load embeddings into
- Incremental Updates (Optional)
- Schedule nightly re-embedding for changed subsets.
- Or investigate Cleora-streaming branch (supports warm-start).
- Benchmark Suite
- Extend
rolegraph_knowledge_graph_ranking_test.rswith MAP@10, nDCG using labelled queries. - Compare old vs hybrid vs pure Cleora.
- Extend
Quick Win
β
IMPLEMENTED: TF-IDF weighted edge rank has been integrated into RoleGraph::query_graph improving differentiation and semantic relevance. The hybrid approach combines traditional graph ranking (70% weight) with TF-IDF scoring (30% weight) for better results.
5. Conclusion
Terraphim Graph offers lightning-fast, explainable ranking, but lacks deeper semantic awareness. Cleora complements this by encoding global graph structure into vectors. A hybrid approach retains Terraphim's speed while boosting recall & semantic relevance.
β
IMPLEMENTATION STATUS: The first improvement from the roadmap has been successfully implemented. TF-IDF scoring is now integrated into the TerraphimGraph relevance function, providing:
- Maintained Performance: Original graph ranking logic preserved
- Enhanced Relevance: TF-IDF scoring (30% weight) improves semantic matching
- Backward Compatibility: No breaking changes to existing functionality
- Validated: All tests pass including the knowledge graph ranking tests
The implementation leverages the existing TFIDFScorer from crates/terraphim_service/src/score/bm25_additional.rs, demonstrating proper architecture and reuse of existing components.
"Counts get you so far; embeddings get you the rest."