Automata Paragraph Extraction
Goal: Given a thesaurus-backed automata, find paragraphs in a text starting from matched automata terms.
Plan
- Build/obtain a
Thesaurusfor the automata - Use
find_matches(text, thesaurus, return_positions=true)to get term matches with positions - For each match:
- Determine paragraph start: either at the start of the term, or right after the term
- Determine paragraph end: the earliest blank-line separator (
\n\n,\r\n\r\n, or\r\r) after the term; fallback to end-of-text - Slice the substring and return alongside the
Matchedmetadata
- Return
Vec<(Matched, String)>for downstream consumers
Function
Implemented in crates/terraphim_automata/src/matcher.rs:
extract_paragraphs_from_automata(text, thesaurus, include_term)find_paragraph_end(text, from_index)
Example
use extract_paragraphs_from_automata;
use ;
let mut thesaurus = new;
let norm = new;
thesaurus.insert;
let text = "Intro\n\nlorem ipsum dolor sit amet,\nconsectetur adipiscing elit.\n\nNext paragraph starts here.";
let results = extract_paragraphs_from_automata?;
assert!;Tests
Run crate tests (tokio is only needed for remote loading; these tests do not require async):
Benchmarks
Paragraph extraction benchmark is added to the existing Criterion suite:
Look for: "extract_paragraphs_from_automata_small_text" in the output.