Terraphim Graph Embeddings vs Cleora

1. Terraphim Graph (Current Implementation)

1.1 Data structures

  • RoleGraph (crate terraphim_rolegraph) maintains:
    • nodes: HashMap<u64, Node> – concepts (unique KG terms)
    • edges: HashMap<u64, Edge> – co-occurrences between concepts found while indexing documents
    • documents: HashMap<String, IndexedDocument> – corpus metadata
    • Aho-Corasick automaton for ultra-fast exact synonym matching (O(text) streaming)

1.2 Indexing / "Training"

  1. Parse each document ➜ split sentences ➜ run AC β‡’ list of matched node-IDs.
  2. For every consecutive pair (x,y) update edge weight.<br/> edge.rank += 1 ; node.rank += 1 ; doc.rank += 1.
  3. Ranks therefore equal frequency counts (integer, monotonic).

1.3 Query & Ranking

score(document) = Ξ£(node.rank + edge.rank + doc.rank)  for all matches
  • Results are sorted by the summed integer score and returned.
  • All operations are in-memory, lock-free (ahash) and extremely fast for < ~1e5 nodes.
  • πŸ†• TF-IDF Enhancement: For TerraphimGraph relevance function, documents are now re-scored using TF-IDF weighting (30% weight) combined with the original graph ranking, providing better semantic relevance while maintaining the fast graph-based foundation.

1.4 Strengths

βœ” Zero-cost online updates (plain counters). βœ” Deterministic and explainable (frequency == relevance). βœ” Small memory footprint (no float vectors). βœ” πŸ†• Hybrid TF-IDF scoring improves semantic relevance while preserving graph structure benefits.

1.5 Limitations

βœ– No semantic smoothing – synonyms outside thesaurus are unseen. βœ– Ties & coarse granularity – many docs share identical sums (partially addressed by TF-IDF). βœ– Ignores global topology (2-hop, motifs, community structure). βœ– Ranking deteriorates for long-tail terms with low frequency (partially addressed by TF-IDF).


2. Cleora (State-of-the-art Rust Graph Embeddings)

Cleora https://github.com/Synerise/cleora is an industrial-scale algorithm that:

  1. Treats each relation as a separate sparse adjacency matrix.
  2. Initialises node vectors with random values on the unit sphere.
  3. Iteratively updates embeddings via Chebyshev polynomial mixed context (T iterations).
  4. Provides L2-normalised dense vectors (128-512 dims) that capture multi-hop semantics.

Key properties:

  • Linear in number of edges, multi-threaded (Rayon) – can embed 1B edges & 100M nodes.<br/>
  • Produces relation-specific or joint embeddings.
  • Written in pure Rust ➜ easy to vendor / build for WASM & native.

Performance (from paper / benchmarks):

  • Hits@10 ↑ 5-25 % over Node2Vec / DeepWalk on standard datasets.
  • Training 10Γ— faster than PyTorch-GPU baselines on CPU-only hardware.

3. Comparative Analysis

| Aspect | Terraphim Graph | Cleora | | --- | --- | --- | | Representation | Integer rank counters | Float32 dense vectors | | Captures multi-hop semantics | ❌ | βœ… (Chebyshev context) | | Online updates | βœ… O(1) | β–² requires re-embedding (mini-batch possible) | | Memory | O(N+E) ints | O(NΒ·d) floats (dβ‰ˆ128) | | Search | Exact match + sum | ANN (cosine / dot) | | Explainability | High (counts) | Medium (vector) | | WASM compatibility | βœ… | βœ… (no std::sync::atomic in wasm32) |


4. Proposed Improvement Roadmap

  1. Hybrid Scoring (Low effort)
    • Keep current frequency rank as prior.
    • Add optional Cleora cosine-similarity score s_emb.
    • Final score: s = Ξ± Β· z_norm(rank) + (1-Ξ±) Β· s_emb (Ξ±β‰ˆ0.3 tuned on dev set).
  2. Offline Embedding Build Step
    • New bin cargo run -p terraphim_rolegraph --bin build-embeddings.
    • Reads thesaurus.json ➜ generates embeddings.bin (NDArray).
    • Store in role config (automata_path sibling).
  3. Runtime Integration
    • Load embeddings into HashMap<NodeId, Vec<f32>>.
    • On query:
      1. Fetch matched node IDs using AC (unchanged).
      2. Average their vectors ➜ query embedding.
      3. HNSW/ANNOY search over pre-built vector index to retrieve top-k docs.
    • Merge with prior ranks.
  4. Incremental Updates (Optional)
    • Schedule nightly re-embedding for changed subsets.
    • Or investigate Cleora-streaming branch (supports warm-start).
  5. Benchmark Suite
    • Extend rolegraph_knowledge_graph_ranking_test.rs with MAP@10, nDCG using labelled queries.
    • Compare old vs hybrid vs pure Cleora.

Quick Win

βœ… IMPLEMENTED: TF-IDF weighted edge rank has been integrated into RoleGraph::query_graph improving differentiation and semantic relevance. The hybrid approach combines traditional graph ranking (70% weight) with TF-IDF scoring (30% weight) for better results.


5. Conclusion

Terraphim Graph offers lightning-fast, explainable ranking, but lacks deeper semantic awareness. Cleora complements this by encoding global graph structure into vectors. A hybrid approach retains Terraphim's speed while boosting recall & semantic relevance.

βœ… IMPLEMENTATION STATUS: The first improvement from the roadmap has been successfully implemented. TF-IDF scoring is now integrated into the TerraphimGraph relevance function, providing:

  • Maintained Performance: Original graph ranking logic preserved
  • Enhanced Relevance: TF-IDF scoring (30% weight) improves semantic matching
  • Backward Compatibility: No breaking changes to existing functionality
  • Validated: All tests pass including the knowledge graph ranking tests

The implementation leverages the existing TFIDFScorer from crates/terraphim_service/src/score/bm25_additional.rs, demonstrating proper architecture and reuse of existing components.

"Counts get you so far; embeddings get you the rest."