A multi-paradigm, formally-verified language-modeling toolkit in Rust β a state-of-the-art Modified Kneser-Ney n-gram core that scales losslessly to the entire Google Books n-gram corpus, layered up through subword/neural embeddings, retrieval, topic discovery, and structured-domain (code & LaTeX) correction.
- A rigorous statistical core. Orders 1β5 with Modified Kneser-Ney (MKN) smoothing β the most accurate count-based smoothing known β over compact varint-indexed trie storage. β jump
- It eats the Google Books n-gram corpus. A sharded, lock-free importer ingests billions of n-grams under a hard heap bound (a naΓ―ve build exhausts memory at 33.79 GB; the bounded build stays < 16 GB, losslessly). β jump
- It is formally verified. The importer's concurrency, crash-recovery, and durability invariants are machine-checked with TLA+ / TLAPS / Apalache and Rocq β not just tested. β jump
- One stack, statistics β neural β retrieval β code. Hybrid n-gramβembedding scoring, ModernBERT rescoring, HNSW retrieval, BERTopic-style topic modeling, and multi-language code correction over Code Property Graphs β composable behind Cargo feature flags. β jump
- Built for concurrency. Lock-free overlays, atomics, persistent (copy-on-write) tries,
Send + Syncthroughout, and amimallocglobal allocator that erases syscall overhead.
libgrammstein is organized as layers: a persistent-storage foundation, a statistical +
embedding core, a hybrid scorer, and a fan of higher-level capabilities (neural, retrieval,
topic, code, LaTeX) that a developer enables Γ la carte via Cargo features. Everything is
Send + Sync; the heavy data structures live in memory-mapped, crash-safe tries.
Four components get a full pedagogical treatment below (math, pseudocode, diagram,
citations); the rest are summarized in the Capability Matrix and covered
in depth under docs/. The color legend above is used consistently in every
diagram.
Every symbol below is reused across sections; component-local terms are defined where they first appear. (Acronyms are expanded on first prose use even if listed here.)
| Symbol | Meaning | Symbol | Meaning |
|---|---|---|---|
w |
word (token) being predicted | [x]βΊ |
max(x, 0) |
h |
history / context (preceding words) | Nββ(Β·) |
continuation count β number of distinct contexts |
hβ² |
backed-off history (h minus its oldest word) |
Ξ±, Ξ» |
interpolation weight(s), β [0,1] |
c(Β·) |
raw training count of an n-gram | Ο |
temperature (softmax sharpness) |
β£Vβ£ |
vocabulary size | Ξ΅ |
approximate-search recall tolerance |
D, Dβ/Dβ/Dββ |
absolute discount(s) | d |
embedding dimensionality |
Ξ³(h) |
backoff weight for history h |
β, Ξ£ |
probability, summation |
| Acronym | Expansion | Acronym | Expansion |
|---|---|---|---|
| MKN | Modified Kneser-Ney smoothing | CPG | Code Property Graph |
| MLE | Maximum-Likelihood Estimate | AST/CFG/DFG | Abstract-Syntax-Tree / Control-Flow-Graph / Data-Flow-Graph |
| OOV | Out-Of-Vocabulary | PCFG | Probabilistic Context-Free Grammar |
| PP | Perplexity | GNN | Graph Neural Network |
| WFST | Weighted Finite-State Transducer | MLM | Masked Language Model |
| RAG | Retrieval-Augmented Generation | MMR | Maximal Marginal Relevance |
| ANN | Approximate Nearest Neighbor | BPE | Byte-Pair Encoding |
| HNSW | Hierarchical Navigable Small-World graph | MFCC | Mel-Frequency Cepstral Coefficients |
| HAC | Hierarchical Agglomerative Clustering | ARTrie | Adaptive Radix Trie (persistent) |
| c-TF-IDF | class-based TF-IDF | WAL | Write-Ahead Log |
Maturity legend (used throughout): π¦ Flagship (battle-tested) Β· π© Stable Β· π¨ Experimental / brings-your-own-weights Β· β¬ Deprecated.
libgrammstein is a workspace crate that depends on sibling repositories via local paths (it is not yet published to crates.io). Clone it alongside its siblings and depend on it by path or git:
[dependencies]
# Required siblings: ../liblevenshtein-rust, ../libdictenstein, ../PathMap, ../lling-llang
libgrammstein = { path = "../libgrammstein", features = ["cli", "rag", "code-mainstream"] }The default build is the always-on statistical + embedding + hybrid core. Everything else is feature-gated, so you compile only what you use:
| Feature | Unlocks |
|---|---|
| (default) | n-gram (MKN), subword embeddings, hybrid model, perplexity, corpus streaming |
google-books |
the sharded, checkpointed Google Books importer (+ mimalloc) |
neural-rescore |
ModernBERT embeddings, MLM rescoring, MMR summarization (Candle) |
rag / rag-hnsw |
retrieval index β exact cosine / HNSW; topic modeling |
code Β· code-{python,rust,javascript,rholang,metta} Β· code-neural |
code correction |
latex / latex-neural / latex-rag |
mode-aware LaTeX modeling |
acoustic / candle-model / gpu |
MFCC features / neural acoustic models / GPU kernels |
lling-llang-integration |
LanguageModel trait + WFST export for lattice rescoring |
cli |
the grammstein binary + terminal UI |
The full surface is in the Capability Matrix.
1 Β· Train an n-gram model and measure perplexity. Perplexity PP(W) = exp(β(1/N)Β·Ξ£α΅’ log β(wα΅’ β£ hα΅’)) is the standard fit metric β lower is better.
use libgrammstein::ngram::{NgramEntry, TrainerBuilder};
use libgrammstein::scoring::Perplexity;
use libgrammstein::corpus::PlaintextReader;
use liblevenshtein::dictionary::dynamic_dawg_char::DynamicDawgChar;
let train = PlaintextReader::from_file("corpus.txt")?;
// A 5-gram model with Modified Kneser-Ney smoothing, over a serializable trie backend.
let model = TrainerBuilder::new(DynamicDawgChar::<NgramEntry>::new())
.order(5)
.train(train)?;
let log_p = model.log_prob("fox", &["quick", "brown"]); // log β(fox β£ quick brown)
let dev = PlaintextReader::from_file("dev.txt")?;
let ppl = Perplexity::new(&model).corpus_perplexity(&dev)?;
println!("log β = {log_p:.3} Β· perplexity = {:.1}", ppl.perplexity);2 Β· Combine n-gram precision with embedding coverage (the hybrid model).
use libgrammstein::hybrid::{HybridConfig, HybridLanguageModel};
// `ngram` + `embedding` trained as above; see "Hybrid Interpolation" for the embedding builder.
let hybrid = HybridLanguageModel::new(ngram, embedding, HybridConfig::default());
let s = hybrid.score("brown", &["the", "quick"]); // robust even if "brown" is rare3 Β· Ingest the Google Books n-gram corpus under a heap bound (one command).
grammstein train import-google-books english.artrie \
--language en --min-order 1 --max-order 5 \
--parallel 8 --overlay-budget-gib 12 --cache-filesFour components earn a full treatment. The rest are mapped in the Capability Matrix.
What & why. An n-gram model estimates β(w β£ h) β the probability of the next word w
given a history h. The naΓ―ve Maximum-Likelihood Estimate (MLE)
β_MLE(w β£ h) = c(hΒ·w) / c(h) (M1)
assigns zero probability to any n-gram never seen in training, which makes log β = ββ for
a whole sentence the moment one unseen n-gram appears. Smoothing fixes this by stealing a
little probability mass from seen events and redistributing it to unseen ones. Modified
Kneser-Ney [1, 2] is the most accurate count-based smoother known, and is libgrammstein's
always-on core.
How. MKN subtracts a count-dependent absolute discount and backs off to a
shorter context, recursively. The discounts are estimated from the corpus's count-of-counts
nα΅’ = #{n-grams occurring exactly i times}:
Y = nβ / (nβ + 2Β·nβ)
Dβ = 1 β 2YΒ·(nβ/nβ) , Dβ = 2 β 3YΒ·(nβ/nβ) , Dββ = 3 β 4YΒ·(nβ/nβ) (M2)
(typically Dβ β 0.6, Dβ β 0.8, Dββ β 0.9). The highest-order estimate discounts then backs
off with weight Ξ³(h):
β_MKN(w β£ h) = [c(hΒ·w) β D(c(hΒ·w))]βΊ / c(h) + Ξ³(h)Β·β_MKN(w β£ hβ²) (M3)
where D(c) = Dβ if c=1, Dβ if c=2, Dββ if cβ₯3 ; hβ² = h without its oldest word
Ξ³(h) = (DβΒ·Nβ(h) + DβΒ·Nβ(h) + DββΒ·Nββ(h)) / c(h) (M4)
Lower orders use continuation counts Nββ β how many distinct contexts a word completes
β instead of raw counts, bottoming out at a uniform base:
Nββ(β’Β·hΒ·w) = β£{ v : c(vΒ·hΒ·w) > 0 }β£ (M5)
β_MKN(w) = [Nββ(β’Β·w) β D]βΊ / Nββ(β’Β·β’) + Ξ³_unifΒ·(1/β£Vβ£)
This is the "San Francisco" intuition: Francisco is frequent but follows only San, so its continuation count is ~1; city follows many words, so it earns a higher fallback probability. Continuation counts measure versatility, not raw frequency.
The algorithm, literately β scoring is a single recursion from the longest matching context down to the uniform base:
function MKN_log_prob(w, h): βΈ public entry point; returns log β
return ln( prob(w, h, highest_order = true) )
function prob(w, h, highest_order):
if h not found in trie: βΈ no evidence at this order β¦
return prob(w, hβ², highest_order = false) βΈ β¦ so back off (drop oldest word of h)
c β count(hΒ·w)
D β discount(c) βΈ Dβ / Dβ / Dββ per (M2)
top β [c β D]βΊ / count(h) βΈ discounted higher-order mass (M3)
Ξ³ β backoff_weight(h) βΈ exactly the mass removed by discounting (M4)
return top + Ξ³ Β· prob(w, hβ², highest_order = false)
βΈ invariant: Ξ³(h) equals the discounted mass, so Ξ£_w β_MKN(w β£ h) = 1 at every order.
Engineering. Keys are vocabulary-indexed varints: each word maps to a u64, LEB128-
encoded into compact Latin-1 trie keys (so a 13 M-word vocabulary costs 1β3 bytes/word, not a
string). Training streams the corpus with Rayon data-parallelism and atomic continuation-count
accumulation. β deep dive: docs/components/ngram/modified-kneser-ney.md.
What & why. N-gram models are razor-sharp on seen local context but blind to unseen words; embeddings are the opposite β they generalize by meaning but lack precise local order. Their failure modes are complementary, so libgrammstein interpolates them. The embedding side is FastText-style subword [3, 4]: a word's vector is the sum of its character-n-gram vectors, so even an OOV word like splendiferous has a meaningful representation.
How. score(w β£ h) blends an n-gram probability ββ = β_MKN(w β£ h) with an embedding
probability ββ β cos(v_w, v_h)/Ο (cosine similarity of the word vector to the context vector,
temperature-scaled). Four strategies are available:
Linear: β = Ξ±Β·ββ + (1βΞ±)Β·ββ
Log-Linear: β = exp( Ξ±Β·log ββ + (1βΞ±)Β·log ββ ) βΈ for differing score scales
Fallback: β = ββ if w β V_ngram else ββ βΈ n-gram when known, embed for OOV
Dynamic: β = Ξ±(h)Β·ββ + (1βΞ±(h))Β·ββ , Ξ±(h) = min(Ξ±β + ΞΊΒ·β£hβ£, Ξ±_max) (M6)
The scorer dispatches on strategy and memoizes in a lock-free cache:
function hybrid_score(w, h):
if (h, w) in cache: return cache[(h, w)] βΈ DashMap probe β no lock on the hot path
s β match strategy: βΈ branch per (M6)
Linear{Ξ±} β Ξ±Β·Pβ(w,h) + (1βΞ±)Β·Pβ(w,h)
LogLinear{Ξ±} β exp(Ξ±Β·ln Pβ + (1βΞ±)Β·ln Pβ)
Fallback β Pβ if known(w) else Pβ βΈ exact local stats when available β¦
Dynamic{β¦} β linear with Ξ± growing in β£hβ£ βΈ β¦ trust n-gram more as context grows
cache.insert((h, w), s); return s βΈ LRU-bounded; lock taken only on eviction
Log-linear is the right default when ββ and ββ live on different scales; Dynamic leans on
the n-gram as the context lengthens (where local statistics are most reliable). β deep dive:
docs/components/hybrid/interpolation.md.
The war story. The Google Books n-gram corpus is enormous β a single 2-gram prefix file
holds 50β100 M entries. A naΓ―ve importer exhausts memory at ~33.79 GB peak heap and burns
~49 % of CPU in __mprotect syscalls. libgrammstein's importer ingests the whole corpus while holding
peak heap under 16 GB, losslessly. Here is how. (feature: google-books)
The mechanism is a small stack of cooperating ideas:
- Sharding β n-grams route to 26 (1-gram) or 676 (2β5-gram) independent persistent ARTrie shards by first-token prefix, so writers rarely contend.
- Per-shard lock-free overlay β a
DashMapwrite-buffer fronts each shard's memory-mapped trie; ingestion is a lock-free compare-and-swap (count = max, idempotent). - Chunked transactions with SET semantics β a prefix commits in fixed-size chunks
(
--tx-chunk-size, default 500 K); re-importing a prefix after a crash overwrites with identical values, so recovery is idempotent. - Crash-safe periodic checkpoints (a cron loop) snapshot progress and flush overlays.
- Overlay-heap eviction β the load-bearing fix. After each checkpoint, the tail evicts a shard's coldest resident overlay nodes down to its slice of the budget. Eviction is lossless: an evicted node faults back from the durable on-disk image on the next read.
- Supporting actors β a single-merge vocabulary WAL (avoids a doubled reverse-index
rebuild), CX path-compressed checkpoints, and the
mimallocallocator (which alone drops__mprotectfrom ~49 % to background noise).
The eviction tail, literately:
function checkpoint_tail(shard, G, resident_shards):
publish_durable(shard.overlay) βΈ write the overlay snapshot to the durable image
budget β max(G / resident_shards, 64 MiB) βΈ this shard's slice of --overlay-budget-gib (M10)
while resident_bytes(shard) > budget:
node β coldest_resident(shard.overlay)
assert node β durable_image(shard) βΈ SAFETY: never evict an un-persisted node β¦
drop_resident(node) βΈ β¦ so the drop is lossless (faults back on read)
resident_bytes(shard) β= sizeof(node)
βΈ invariant: evicted β durable. This is exactly what the TLA+ specs
βΈ CheckpointStateMachine / PersistentStorageBridge machine-check (see Formal Verification).
These are not just "well-tested" invariants β they are machine-checked (Formal
Verification). β deep dive:
docs/architecture/memory-optimization.md Β·
docs/cli/import-google-books.md.
What & why. Semantic retrieval (the R in RAG, Retrieval-Augmented Generation) reduces to nearest-neighbor search in embedding space: rank documents by cosine similarity to a query vector. For unit-normalized vectors this is just a dot product:
cos(a, b) = (aΒ·b) / (βaβΒ·βbβ) = Γ’ Β· bΜ (M7)
libgrammstein offers two backends with a clean accuracy/latency trade-off (features: rag,
rag-hnsw):
| Backend | Query cost | Build cost | Recall | Use when |
|---|---|---|---|---|
| Exact cosine (ndarray/BLAS) | O(nΒ·d) |
O(n) |
100 % | n β² 10βΆ documents |
| HNSW [7] | O(log n) |
O(nΒ·log n) |
tunable via Ξ΅ |
n β³ 10βΆ documents |
How HNSW works. A Hierarchical Navigable Small-World graph stacks proximity graphs in layers β sparse at the top, dense at the bottom (a skip-list for vectors). Search starts coarse and refines downward:
function hnsw_search(q, k, ef): βΈ ef = breadth knob (recall vs speed)
v β entry_point(top_layer)
for layer from top down to 1: βΈ coarse-to-fine descent
v β greedy_nearest(q, from = v, in = layer)βΈ hop to the closest neighbor until no improvement
C β beam_search(q, from = v, in = layer_0, breadth = ef) βΈ widen at the base layer
return top_k(C, k)
βΈ log-time because each upper layer halves the remaining search horizon.
ef_search, M (neighbors/node), and ef_construction trade recall for speed/memory; the
backend builds lazily on first query and is thread-safe. β deep dive:
docs/components/rag/backend.md.
What & why. Correcting source code needs syntax, control flow, and data flow at
once. A Code Property Graph (CPG) [9] fuses three classic representations into one joint
structure β the AST (Abstract Syntax Tree: nesting), the CFG (Control-Flow Graph:
branch/loop edges), and the DFG (Data-Flow Graph: definitionβuse edges) β so a single
traversal sees structure and behavior together. (features: code, code-{python,rust,javascript,rholang,metta})
How. A tree-sitter incremental parse builds the CPG; an ensemble of correctors then
proposes ranked fixes:
function correct(source):
cpg β build_cpg(tree_sitter_parse(source)) βΈ AST βͺ CFG βͺ DFG in one graph
candidates β lexical(cpg) βΈ fuzzy match vs dictionary (liblevenshtein)
βͺ grammar(cpg) βΈ PCFG + Earley: is this a likely production?
βͺ semantic(cpg) βΈ π¨ GNN over the CPG: variable misuse, dead code
return rank_by_confidence(candidates)
The grammar corrector scores edits under a Probabilistic Context-Free Grammar (PCFG) with
an Earley parser; the semantic corrector runs a Graph Neural Network (GNN) [10] over the
CPG. The GNN path is π¨ experimental β a natural contribution surface. The same machinery
powers grammar-constrained decoding and WFST export, plus paradigm detection (PrefixSpan [13]),
subtree mining (TreeMiner [14]), and pretrained code embeddings (CodeT5+ / UniXcoder /
GraphCodeBERT, via ONNX). β deep dive: docs/components/code/cpg.md.
What & why. Given a corpus of document embeddings, recover human-readable topics
without a fixed topic count. libgrammstein follows the BERTopic recipe [8]: embed β
cluster β describe. (feature: rag)
How.
- Cluster embeddings with Hierarchical Agglomerative Clustering (HAC) β iteratively merge the closest clusters under a Lance-Williams linkage (single / complete / average / Ward), yielding a dendrogram you can cut at any granularity.
- Describe each cluster
cwith class-based TF-IDF (c-TF-IDF) β TF-IDF computed per-cluster rather than per-document, so the top-scoring terms are the cluster's keywords:
c-TF-IDF(t, c) = tf(t, c) Β· log( 1 + A / f(t) ) (M8)
tf(t, c) = frequency of term t in cluster c A = average words per cluster
f(t) = number of clusters containing term t
A term scores high when it is frequent in its cluster yet rare across clusters β exactly
the words that name a topic. β deep dive:
docs/components/topic/ctfidf.md.
Beyond the flagships, libgrammstein spans the full NLP/ML pipeline. Each item below is a one-line intuition + a deep-doc link; the Capability Matrix is the at-a-glance view.
FastText subword embeddings [4] (skip-gram [3] + negative sampling) with a fluent builder; BPE subword tokenization [5]; phonetic embeddings (Zompist-style sound-change rules for sound-alike similarity); and acoustic word embeddings from MFCC/mel-filterbank features [6] (optionally a Candle neural encoder).
use libgrammstein::embedding::EmbeddingTrainerBuilder;
let embedding = EmbeddingTrainerBuilder::new()
.dim(100).window_size(5).min_count(5).epochs(10)
.train(reader)?; // or .train_streaming(|| reopen_reader()) for >500 MB corpora
let neighbors = embedding.most_similar("running", 10);β docs/components/embedding/overview.md Β·
docs/components/acoustic/features.md
A ModernBERT [11] wrapper (Candle + Hugging Face Hub) that rescores n-best beam paths by
MLM pseudo-perplexity (mask each token, score the prediction), produces document
embeddings, and runs extractive summarization via Maximal Marginal Relevance (MMR) [12] β a
relevance/novelty trade-off that suppresses redundancy. Bring your own weights (pulled from
HF Hub at runtime). (feature: neural-rescore) β
docs/components/neural/overview.md
A mode-aware tokenizer distinguishes math mode from text mode, feeding mode-weighted n-gram
models, command embeddings, optional ModernBERT rescoring, and equation retrieval. (features:
latex, latex-neural, latex-rag) β docs/components/
libgrammstein implements lling-llang's
LanguageModel trait so any model here can rescore a WFST lattice [15] (lazy or eager
transducer export). Its storage rests on two sibling crates:
liblevenshtein (trie dictionaries + fuzzy
matching) and libdictenstein (the persistent ARTrie: lock-free overlay, CX path-compressed
checkpoints, and the overlay-heap eviction the importer relies on).
use libgrammstein::integration::GrammsteinLanguageModel;
use lling_llang::layers::{LanguageModelLayer, SpellingCorrectionLayer, LayerPipelineBuilder};
let lm = GrammsteinLanguageModel::from_hybrid(hybrid); // wrap any model
let pipeline = LayerPipelineBuilder::new()
.add_layer(SpellingCorrectionLayer::new(dictionary, 2)) // fuzzy candidates (liblevenshtein)
.add_layer(LanguageModelLayer::new(Box::new(lm), 1.0)) // rescore with libgrammstein
.build();
// "teh quikc brwon fox" β "the quick brown fox"β docs/integration/lling-llang/overview.md
The importer's correctness-sensitive machinery β concurrency, crash recovery, checkpoint durability, and query semantics β is machine-checked, not merely tested. This is a core differentiator: the lossless-eviction invariant from the importer is a proved property.
-
TLA+ (Temporal Logic of Actions) β 7 live specifications model the protocols and are checked by TLC (bounded model checking) and typechecked by Apalache:
Concern Specification(s) Verifies (examples) Concurrency AsyncShardSyncat most one syncer; clean β zero dirty Lifecycle ImporterLifecycle,WorkerShutdown,CronStateMachinephase ordering; no job lost; termination Durability CheckpointStateMachine,PersistentStorageBridge,QuerySemanticsBridgeno-loss publish; lossless eviction; no metadata leak -
TLAPS β machine-checked inductive proofs of the specs (e.g.
AsyncShardSync: 249 proof obligations discharged). -
Rocq β
MknStatistics/MknFloatBoundsbound the MKN discounts and theirbinary64evaluation;FrequencyCountsMergeproves count-merge associativity/commutativity. -
loom β exhaustively explores concurrent memory-ordering interleavings in the Rust code.
-
Dependency bridges connect these specs to libdictenstein/liblevenshtein durability contracts, so the high-level proofs rest on verified storage.
Reproduce the whole gate:
make -C formal complete-with-dependencies # TLC + TLAPS + Apalache + Rocq + loom + alignmentβ formal/README.md
The grammstein binary (feature cli) drives training, evaluation, model inspection, corpus
processing, and the Google Books import β the last with a π¨ live ratatui terminal UI
showing per-worker progress, checkpoint state, and vocabulary growth.
grammstein train ngram corpus.txt --order 5 -o model.bin # train
grammstein eval perplexity model.bin dev.txt # evaluate
grammstein train import-google-books en.artrie --language en --parallel 8 # import (+ TUI)Every subsystem, its Cargo feature(s), core algorithm, maturity, and deep doc.
| Subsystem | What it does | Cargo feature(s) | Key algorithm / tech | Maturity | Deep doc |
|---|---|---|---|---|---|
| N-gram LM | next-word probability | default | Modified Kneser-Ney [1,2] | π¦ | ngram |
| Hybrid model | n-gram β embeddings | default | 4 interpolation strategies | π© | hybrid |
| Embeddings | dense word vectors | default | FastText subword [3,4] | π© | embedding |
| BPE | subword tokenizer | subword |
Byte-Pair Encoding [5] | π© | bpe |
| Acoustic | audio features / encoders | acoustic, candle-model |
MFCC / mel-filterbank [6] | π© | acoustic |
| Neural rescore | MLM rescoring, embeddings | neural-rescore |
ModernBERT [11] | π¨ | neural |
| Summarization | extractive summaries | neural-rescore |
MMR [12] | π¨ | summarizer |
| RAG | semantic retrieval | rag, rag-hnsw |
cosine / HNSW [7] | π© | rag |
| Topic modeling | unsupervised topics | rag |
HAC + c-TF-IDF [8] | π© | topic |
| Code correction | multi-language fixes | code, code-* |
CPG [9] + PCFG + GNN [10] | π© / π¨ | code |
| Code embeddings | neural code vectors | code-neural |
CodeT5+ / UniXcoder / GraphCodeBERT (ONNX) | π¨ | code-emb |
| Paradigm detection | OOP/FP/β¦ inference | code |
PrefixSpan [13] | π¨ | paradigm |
| Subtree mining | frequent code patterns | code |
TreeMiner [14] | π¨ | subtree |
| LaTeX | math-aware modeling | latex, latex-* |
mode-aware n-gram + neural | π¨ | latex |
| Importer | Google Books ingestion | google-books |
sharded ARTrie + eviction | π¦ | memory |
| Persistent storage | crash-safe tries | (via libdictenstein) | overlay + CX + eviction | π¦ | data-flow |
| WFST integration | lattice rescoring | lling-llang-integration |
WFST export [15] | π© | lling-llang |
| CLI / TUI | command-line tooling | cli |
clap + ratatui | π© / π¨ | cli |
| Formal verification | machine-checked proofs | (in formal/) |
TLA+ Β· TLAPS Β· Apalache Β· Rocq | π¦ | formal |
Indicative single-thread microbenchmarks (hardware-specific; reproduce with
benches/ β ngram_query, mkn_frequency_counts, overlay_eviction,
checkpoint_ops, varint_encoding, hash_comparison):
| Operation | Time |
|---|---|
| N-gram probability query | ~100 ns |
| Embedding lookup (cached) | ~10 ns |
| Hybrid score | ~1 Β΅s |
| Sentence score (10 tokens) | ~10 Β΅s |
The Google Books importer's headline result is bounded peak heap (33.79 GB β < 16 GB) and
near-zero __mprotect overhead β see the importer.
src/
βββ ngram/ # Modified Kneser-Ney n-gram model, varint vocabulary, trie storage
βββ embedding/ # FastText subword Β· BPE Β· phonetic Β· acoustic Β· GPU
βββ hybrid/ # n-gram β embedding interpolation + OOV handling
βββ scoring/ # perplexity, sentence scoring Β· generation/ # sampling
βββ corpus/ # streaming readers (Wikipedia, Gutenberg, plaintext), dedup, quality
βββ dictionary/ # word extraction, spelling dictionaries
βββ sources/ # Google Books importer: sharding Β· storage Β· checkpoint Β· eviction [google-books]
βββ neural/ # ModernBERT embedder Β· rescorer Β· summarizer [neural-rescore]
βββ rag/ # retrieval index: exact cosine Β· HNSW [rag]
βββ topic/ # HAC clustering Β· c-TF-IDF Β· dendrogram [rag]
βββ code/ # tree-sitter Β· CPG Β· PCFG Β· GNN Β· constrained decoding [code]
βββ latex/ # mode-aware tokenizer Β· n-gram Β· embeddings Β· equation RAG [latex]
βββ language/ # whatlang detection + language-aware tokenization
βββ integration/ # lling-llang LanguageModel trait + WFST export [lling-llang-integration]
βββ cli/ # the `grammstein` binary + ratatui TUI [cli]
formal/ # TLA+ / TLAPS / Apalache specs + Rocq proofs + loom tests
docs/ # deep component documentation (see docs/README.md)
DOIs are linked and verified; arXiv-only works link their abstract page.
- R. Kneser & H. Ney (1995). Improved backing-off for M-gram language modeling. ICASSP '95, 181β184. doi:10.1109/ICASSP.1995.479394
- S. F. Chen & J. Goodman (1999). An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13(4), 359β393. doi:10.1006/csla.1999.0128
- T. Mikolov, I. Sutskever, K. Chen, G. Corrado & J. Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. NeurIPS. arXiv:1310.4546
- P. Bojanowski, E. Grave, A. Joulin & T. Mikolov (2017). Enriching Word Vectors with Subword Information (FastText). TACL 5, 135β146. doi:10.1162/tacl_a_00051
- R. Sennrich, B. Haddow & A. Birch (2016). Neural Machine Translation of Rare Words with Subword Units (BPE). ACL, 1715β1725. doi:10.18653/v1/P16-1162
- S. Davis & P. Mermelstein (1980). Comparison of parametric representations for monosyllabic word recognitionβ¦ (MFCC). IEEE TASSP 28(4), 357β366. doi:10.1109/TASSP.1980.1163420
- Yu. A. Malkov & D. A. Yashunin (2020). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE TPAMI 42(4), 824β836. doi:10.1109/TPAMI.2018.2889473
- M. Grootendorst (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv:2203.05794
- F. Yamaguchi, N. Golde, D. Arp & K. Rieck (2014). Modeling and Discovering Vulnerabilities with Code Property Graphs. IEEE S&P, 590β604. doi:10.1109/SP.2014.44
- T. N. Kipf & M. Welling (2017). Semi-Supervised Classification with Graph Convolutional Networks. ICLR. arXiv:1609.02907
- B. Warner et al. (2024). Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder⦠(ModernBERT). arXiv:2412.13663
- J. Carbonell & J. Goldstein (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. ACM SIGIR, 335β336. doi:10.1145/290941.291025
- J. Pei et al. (2004). Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach. IEEE TKDE 16(11), 1424β1440. doi:10.1109/TKDE.2004.77
- M. J. Zaki (2005). Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications (TreeMiner). IEEE TKDE 17(8), 1021β1035. doi:10.1109/TKDE.2005.125
- M. Mohri, F. Pereira & M. Riley (2002). Weighted finite-state transducers in speech recognition. Computer Speech & Language 16(1), 69β88. doi:10.1006/csla.2001.0184
The RAG retrieval layer is inspired by retrieval-augmented architectures (Lewis et al., 2020, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, arXiv:2005.11401); libgrammstein implements the retrieval/indexing half, not end-to-end generation.
Licensed under MIT OR Apache-2.0.
- liblevenshtein-rust β fuzzy matching & trie dictionaries
- libdictenstein β persistent adaptive-radix trie with lock-free overlay & eviction
- lling-llang β WFST framework for text correction
- F1R3FLY.io β distributed computing platform