Terraphim Graph Embeddings vs Cleora

1. Terraphim Graph (Current Implementation)

1.1 Data structures

RoleGraph (crate terraphim_rolegraph) maintains:
- nodes: HashMap<u64, Node> – concepts (unique KG terms)
- edges: HashMap<u64, Edge> – co-occurrences between concepts found while indexing documents
- documents: HashMap<String, IndexedDocument> – corpus metadata
- Aho-Corasick automaton for ultra-fast exact synonym matching (O(text) streaming)

1.2 Indexing / "Training"

Parse each document ➜ split sentences ➜ run AC ⇒ list of matched node-IDs.
For every consecutive pair (x,y) update edge weight.<br/> edge.rank += 1 ; node.rank += 1 ; doc.rank += 1.
Ranks therefore equal frequency counts (integer, monotonic).

1.3 Query & Ranking

score(document) = Σ(node.rank + edge.rank + doc.rank)  for all matches

Results are sorted by the summed integer score and returned.
All operations are in-memory, lock-free (ahash) and extremely fast for < ~1e5 nodes.
🆕 TF-IDF Enhancement: For TerraphimGraph relevance function, documents are now re-scored using TF-IDF weighting (30% weight) combined with the original graph ranking, providing better semantic relevance while maintaining the fast graph-based foundation.

1.4 Strengths

✔ Zero-cost online updates (plain counters). ✔ Deterministic and explainable (frequency == relevance). ✔ Small memory footprint (no float vectors). ✔ 🆕 Hybrid TF-IDF scoring improves semantic relevance while preserving graph structure benefits.

1.5 Limitations

✖ No semantic smoothing – synonyms outside thesaurus are unseen. ✖ Ties & coarse granularity – many docs share identical sums (partially addressed by TF-IDF). ✖ Ignores global topology (2-hop, motifs, community structure). ✖ Ranking deteriorates for long-tail terms with low frequency (partially addressed by TF-IDF).

2. Cleora (State-of-the-art Rust Graph Embeddings)

Cleora https://github.com/Synerise/cleora is an industrial-scale algorithm that:

Treats each relation as a separate sparse adjacency matrix.
Initialises node vectors with random values on the unit sphere.
Iteratively updates embeddings via Chebyshev polynomial mixed context (T iterations).
Provides L2-normalised dense vectors (128-512 dims) that capture multi-hop semantics.

Key properties:

Linear in number of edges, multi-threaded (Rayon) – can embed 1B edges & 100M nodes.<br/>
Produces relation-specific or joint embeddings.
Written in pure Rust ➜ easy to vendor / build for WASM & native.

Performance (from paper / benchmarks):

Hits@10 ↑ 5-25 % over Node2Vec / DeepWalk on standard datasets.
Training 10× faster than PyTorch-GPU baselines on CPU-only hardware.

3. Comparative Analysis

| Aspect | Terraphim Graph | Cleora | | --- | --- | --- | | Representation | Integer rank counters | Float32 dense vectors | | Captures multi-hop semantics | ❌ | ✅ (Chebyshev context) | | Online updates | ✅ O(1) | ▲ requires re-embedding (mini-batch possible) | | Memory | O(N+E) ints | O(N·d) floats (d≈128) | | Search | Exact match + sum | ANN (cosine / dot) | | Explainability | High (counts) | Medium (vector) | | WASM compatibility | ✅ | ✅ (no std::sync::atomic in wasm32) |

4. Proposed Improvement Roadmap

Hybrid Scoring (Low effort)
- Keep current frequency rank as prior.
- Add optional Cleora cosine-similarity score s_emb.
- Final score: s = α · z_norm(rank) + (1-α) · s_emb (α≈0.3 tuned on dev set).
Offline Embedding Build Step
- New bin cargo run -p terraphim_rolegraph --bin build-embeddings.
- Reads thesaurus.json ➜ generates embeddings.bin (NDArray).
- Store in role config (automata_path sibling).
Runtime Integration
- Load embeddings into HashMap<NodeId, Vec<f32>>.
- On query:
  1. Fetch matched node IDs using AC (unchanged).
  2. Average their vectors ➜ query embedding.
  3. HNSW/ANNOY search over pre-built vector index to retrieve top-k docs.
- Merge with prior ranks.
Incremental Updates (Optional)
- Schedule nightly re-embedding for changed subsets.
- Or investigate Cleora-streaming branch (supports warm-start).
Benchmark Suite
- Extend rolegraph_knowledge_graph_ranking_test.rs with MAP@10, nDCG using labelled queries.
- Compare old vs hybrid vs pure Cleora.

Quick Win

✅ IMPLEMENTED: TF-IDF weighted edge rank has been integrated into RoleGraph::query_graph improving differentiation and semantic relevance. The hybrid approach combines traditional graph ranking (70% weight) with TF-IDF scoring (30% weight) for better results.

5. Conclusion

Terraphim Graph offers lightning-fast, explainable ranking, but lacks deeper semantic awareness. Cleora complements this by encoding global graph structure into vectors. A hybrid approach retains Terraphim's speed while boosting recall & semantic relevance.

✅ IMPLEMENTATION STATUS: The first improvement from the roadmap has been successfully implemented. TF-IDF scoring is now integrated into the TerraphimGraph relevance function, providing:

Maintained Performance: Original graph ranking logic preserved
Enhanced Relevance: TF-IDF scoring (30% weight) improves semantic matching
Backward Compatibility: No breaking changes to existing functionality
Validated: All tests pass including the knowledge graph ranking tests

The implementation leverages the existing TFIDFScorer from crates/terraphim_service/src/score/bm25_additional.rs, demonstrating proper architecture and reuse of existing components.

"Counts get you so far; embeddings get you the rest."