Skip to content

vinary-tree/libgrammstein

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

81 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

libgrammstein

A multi-paradigm, formally-verified language-modeling toolkit in Rust β€” a state-of-the-art Modified Kneser-Ney n-gram core that scales losslessly to the entire Google Books n-gram corpus, layered up through subword/neural embeddings, retrieval, topic discovery, and structured-domain (code & LaTeX) correction.

License Rust Formally verified Tests Concurrency


Why libgrammstein?

  • A rigorous statistical core. Orders 1–5 with Modified Kneser-Ney (MKN) smoothing β€” the most accurate count-based smoothing known β€” over compact varint-indexed trie storage. β†’ jump
  • It eats the Google Books n-gram corpus. A sharded, lock-free importer ingests billions of n-grams under a hard heap bound (a naΓ―ve build exhausts memory at 33.79 GB; the bounded build stays < 16 GB, losslessly). β†’ jump
  • It is formally verified. The importer's concurrency, crash-recovery, and durability invariants are machine-checked with TLA+ / TLAPS / Apalache and Rocq β€” not just tested. β†’ jump
  • One stack, statistics β†’ neural β†’ retrieval β†’ code. Hybrid n-gramβŠ•embedding scoring, ModernBERT rescoring, HNSW retrieval, BERTopic-style topic modeling, and multi-language code correction over Code Property Graphs β€” composable behind Cargo feature flags. β†’ jump
  • Built for concurrency. Lock-free overlays, atomics, persistent (copy-on-write) tries, Send + Sync throughout, and a mimalloc global allocator that erases syscall overhead.

At a Glance

libgrammstein is organized as layers: a persistent-storage foundation, a statistical + embedding core, a hybrid scorer, and a fan of higher-level capabilities (neural, retrieval, topic, code, LaTeX) that a developer enables Γ  la carte via Cargo features. Everything is Send + Sync; the heavy data structures live in memory-mapped, crash-safe tries.

Layered architecture

Four components get a full pedagogical treatment below (math, pseudocode, diagram, citations); the rest are summarized in the Capability Matrix and covered in depth under docs/. The color legend above is used consistently in every diagram.


Core Concepts & Notation

Every symbol below is reused across sections; component-local terms are defined where they first appear. (Acronyms are expanded on first prose use even if listed here.)

Symbol Meaning Symbol Meaning
w word (token) being predicted [x]⁺ max(x, 0)
h history / context (preceding words) Nβ‚β‚Š(Β·) continuation count β€” number of distinct contexts
hβ€² backed-off history (h minus its oldest word) Ξ±, Ξ» interpolation weight(s), ∈ [0,1]
c(Β·) raw training count of an n-gram Ο„ temperature (softmax sharpness)
∣V∣ vocabulary size Ρ approximate-search recall tolerance
D, D₁/Dβ‚‚/Dβ‚ƒβ‚Š absolute discount(s) d embedding dimensionality
Ξ³(h) backoff weight for history h β„™, Ξ£ probability, summation
Acronym Expansion Acronym Expansion
MKN Modified Kneser-Ney smoothing CPG Code Property Graph
MLE Maximum-Likelihood Estimate AST/CFG/DFG Abstract-Syntax-Tree / Control-Flow-Graph / Data-Flow-Graph
OOV Out-Of-Vocabulary PCFG Probabilistic Context-Free Grammar
PP Perplexity GNN Graph Neural Network
WFST Weighted Finite-State Transducer MLM Masked Language Model
RAG Retrieval-Augmented Generation MMR Maximal Marginal Relevance
ANN Approximate Nearest Neighbor BPE Byte-Pair Encoding
HNSW Hierarchical Navigable Small-World graph MFCC Mel-Frequency Cepstral Coefficients
HAC Hierarchical Agglomerative Clustering ARTrie Adaptive Radix Trie (persistent)
c-TF-IDF class-based TF-IDF WAL Write-Ahead Log

Maturity legend (used throughout): 🟦 Flagship (battle-tested) · 🟩 Stable · 🟨 Experimental / brings-your-own-weights · ⬜ Deprecated.


Installation & Feature Map

libgrammstein is a workspace crate that depends on sibling repositories via local paths (it is not yet published to crates.io). Clone it alongside its siblings and depend on it by path or git:

[dependencies]
# Required siblings: ../liblevenshtein-rust, ../libdictenstein, ../PathMap, ../lling-llang
libgrammstein = { path = "../libgrammstein", features = ["cli", "rag", "code-mainstream"] }

The default build is the always-on statistical + embedding + hybrid core. Everything else is feature-gated, so you compile only what you use:

Feature Unlocks
(default) n-gram (MKN), subword embeddings, hybrid model, perplexity, corpus streaming
google-books the sharded, checkpointed Google Books importer (+ mimalloc)
neural-rescore ModernBERT embeddings, MLM rescoring, MMR summarization (Candle)
rag / rag-hnsw retrieval index β€” exact cosine / HNSW; topic modeling
code Β· code-{python,rust,javascript,rholang,metta} Β· code-neural code correction
latex / latex-neural / latex-rag mode-aware LaTeX modeling
acoustic / candle-model / gpu MFCC features / neural acoustic models / GPU kernels
lling-llang-integration LanguageModel trait + WFST export for lattice rescoring
cli the grammstein binary + terminal UI

The full surface is in the Capability Matrix.


Quick Start

1 Β· Train an n-gram model and measure perplexity. Perplexity PP(W) = exp(βˆ’(1/N)Β·Ξ£α΅’ log β„™(wα΅’ ∣ hα΅’)) is the standard fit metric β€” lower is better.

use libgrammstein::ngram::{NgramEntry, TrainerBuilder};
use libgrammstein::scoring::Perplexity;
use libgrammstein::corpus::PlaintextReader;
use liblevenshtein::dictionary::dynamic_dawg_char::DynamicDawgChar;

let train = PlaintextReader::from_file("corpus.txt")?;
// A 5-gram model with Modified Kneser-Ney smoothing, over a serializable trie backend.
let model = TrainerBuilder::new(DynamicDawgChar::<NgramEntry>::new())
    .order(5)
    .train(train)?;

let log_p = model.log_prob("fox", &["quick", "brown"]);     // log β„™(fox ∣ quick brown)
let dev   = PlaintextReader::from_file("dev.txt")?;
let ppl   = Perplexity::new(&model).corpus_perplexity(&dev)?;
println!("log β„™ = {log_p:.3} Β· perplexity = {:.1}", ppl.perplexity);

2 Β· Combine n-gram precision with embedding coverage (the hybrid model).

use libgrammstein::hybrid::{HybridConfig, HybridLanguageModel};

// `ngram` + `embedding` trained as above; see "Hybrid Interpolation" for the embedding builder.
let hybrid = HybridLanguageModel::new(ngram, embedding, HybridConfig::default());
let s = hybrid.score("brown", &["the", "quick"]);           // robust even if "brown" is rare

3 Β· Ingest the Google Books n-gram corpus under a heap bound (one command).

grammstein train import-google-books english.artrie \
    --language en --min-order 1 --max-order 5 \
    --parallel 8 --overlay-budget-gib 12 --cache-files

The Flagships

Four components earn a full treatment. The rest are mapped in the Capability Matrix.

Statistical Core: Modified Kneser-Ney

What & why. An n-gram model estimates β„™(w ∣ h) β€” the probability of the next word w given a history h. The naΓ―ve Maximum-Likelihood Estimate (MLE)

β„™_MLE(w ∣ h) = c(hΒ·w) / c(h)          (M1)

assigns zero probability to any n-gram never seen in training, which makes log β„™ = βˆ’βˆž for a whole sentence the moment one unseen n-gram appears. Smoothing fixes this by stealing a little probability mass from seen events and redistributing it to unseen ones. Modified Kneser-Ney [1, 2] is the most accurate count-based smoother known, and is libgrammstein's always-on core.

How. MKN subtracts a count-dependent absolute discount and backs off to a shorter context, recursively. The discounts are estimated from the corpus's count-of-counts nα΅’ = #{n-grams occurring exactly i times}:

Y  = n₁ / (n₁ + 2Β·nβ‚‚)
D₁ = 1 βˆ’ 2YΒ·(nβ‚‚/n₁) ,  Dβ‚‚ = 2 βˆ’ 3YΒ·(n₃/nβ‚‚) ,  Dβ‚ƒβ‚Š = 3 βˆ’ 4YΒ·(nβ‚„/n₃)      (M2)

(typically D₁ β‰ˆ 0.6, Dβ‚‚ β‰ˆ 0.8, Dβ‚ƒβ‚Š β‰ˆ 0.9). The highest-order estimate discounts then backs off with weight Ξ³(h):

β„™_MKN(w ∣ h) = [c(hΒ·w) βˆ’ D(c(hΒ·w))]⁺ / c(h)  +  Ξ³(h)Β·β„™_MKN(w ∣ hβ€²)        (M3)
   where  D(c) = D₁ if c=1,  Dβ‚‚ if c=2,  Dβ‚ƒβ‚Š if cβ‰₯3 ;   hβ€² = h without its oldest word
Ξ³(h) = (D₁·N₁(h) + Dβ‚‚Β·Nβ‚‚(h) + Dβ‚ƒβ‚ŠΒ·Nβ‚ƒβ‚Š(h)) / c(h)                          (M4)

Lower orders use continuation counts Nβ‚β‚Š β€” how many distinct contexts a word completes β€” instead of raw counts, bottoming out at a uniform base:

Nβ‚β‚Š(β€’Β·hΒ·w) = ∣{ v : c(vΒ·hΒ·w) > 0 }∣                                       (M5)
β„™_MKN(w)   = [Nβ‚β‚Š(β€’Β·w) βˆ’ D]⁺ / Nβ‚β‚Š(β€’Β·β€’)  +  Ξ³_unifΒ·(1/∣V∣)

This is the "San Francisco" intuition: Francisco is frequent but follows only San, so its continuation count is ~1; city follows many words, so it earns a higher fallback probability. Continuation counts measure versatility, not raw frequency.

MKN backoff recursion

The algorithm, literately β€” scoring is a single recursion from the longest matching context down to the uniform base:

function MKN_log_prob(w, h):                       β–Έ public entry point; returns log β„™
    return ln( prob(w, h, highest_order = true) )

function prob(w, h, highest_order):
    if h not found in trie:                        β–Έ no evidence at this order …
        return prob(w, hβ€², highest_order = false)  β–Έ … so back off (drop oldest word of h)
    c   ← count(hΒ·w)
    D   ← discount(c)                              β–Έ D₁ / Dβ‚‚ / Dβ‚ƒβ‚Š  per (M2)
    top ← [c βˆ’ D]⁺ / count(h)                      β–Έ discounted higher-order mass (M3)
    Ξ³   ← backoff_weight(h)                        β–Έ exactly the mass removed by discounting (M4)
    return top + Ξ³ Β· prob(w, hβ€², highest_order = false)
    β–Έ invariant: Ξ³(h) equals the discounted mass, so Ξ£_w β„™_MKN(w ∣ h) = 1 at every order.

Engineering. Keys are vocabulary-indexed varints: each word maps to a u64, LEB128- encoded into compact Latin-1 trie keys (so a 13 M-word vocabulary costs 1–3 bytes/word, not a string). Training streams the corpus with Rayon data-parallelism and atomic continuation-count accumulation. β†’ deep dive: docs/components/ngram/modified-kneser-ney.md.

Hybrid Interpolation: Statistics βŠ• Semantics

What & why. N-gram models are razor-sharp on seen local context but blind to unseen words; embeddings are the opposite β€” they generalize by meaning but lack precise local order. Their failure modes are complementary, so libgrammstein interpolates them. The embedding side is FastText-style subword [3, 4]: a word's vector is the sum of its character-n-gram vectors, so even an OOV word like splendiferous has a meaningful representation.

How. score(w ∣ h) blends an n-gram probability β„™β‚™ = β„™_MKN(w ∣ h) with an embedding probability β„™β‚‘ ∝ cos(v_w, v_h)/Ο„ (cosine similarity of the word vector to the context vector, temperature-scaled). Four strategies are available:

Linear:      β„™ = Ξ±Β·β„™β‚™ + (1βˆ’Ξ±)Β·β„™β‚‘
Log-Linear:  β„™ = exp( Ξ±Β·log β„™β‚™ + (1βˆ’Ξ±)Β·log β„™β‚‘ )           β–Έ for differing score scales
Fallback:    β„™ = β„™β‚™  if w ∈ V_ngram  else  β„™β‚‘              β–Έ n-gram when known, embed for OOV
Dynamic:     β„™ = Ξ±(h)Β·β„™β‚™ + (1βˆ’Ξ±(h))Β·β„™β‚‘ ,  Ξ±(h) = min(Ξ±β‚€ + κ·∣h∣, Ξ±_max)   (M6)

Hybrid scoring flow

The scorer dispatches on strategy and memoizes in a lock-free cache:

function hybrid_score(w, h):
    if (h, w) in cache: return cache[(h, w)]       β–Έ DashMap probe β€” no lock on the hot path
    s ← match strategy:                            β–Έ branch per (M6)
          Linear{Ξ±}    β†’ Ξ±Β·Pβ‚™(w,h) + (1βˆ’Ξ±)Β·Pβ‚‘(w,h)
          LogLinear{Ξ±} β†’ exp(Ξ±Β·ln Pβ‚™ + (1βˆ’Ξ±)Β·ln Pβ‚‘)
          Fallback     β†’ Pβ‚™ if known(w) else Pβ‚‘    β–Έ exact local stats when available …
          Dynamic{…}   β†’ linear with Ξ± growing in ∣h∣   β–Έ … trust n-gram more as context grows
    cache.insert((h, w), s); return s              β–Έ LRU-bounded; lock taken only on eviction

Log-linear is the right default when β„™β‚™ and β„™β‚‘ live on different scales; Dynamic leans on the n-gram as the context lengthens (where local statistics are most reliable). β†’ deep dive: docs/components/hybrid/interpolation.md.

Systems Flagship: The Google Books Importer

The war story. The Google Books n-gram corpus is enormous β€” a single 2-gram prefix file holds 50–100 M entries. A naΓ―ve importer exhausts memory at ~33.79 GB peak heap and burns ~49 % of CPU in __mprotect syscalls. libgrammstein's importer ingests the whole corpus while holding peak heap under 16 GB, losslessly. Here is how. (feature: google-books)

Importer dataflow + eviction

The mechanism is a small stack of cooperating ideas:

  1. Sharding β€” n-grams route to 26 (1-gram) or 676 (2–5-gram) independent persistent ARTrie shards by first-token prefix, so writers rarely contend.
  2. Per-shard lock-free overlay β€” a DashMap write-buffer fronts each shard's memory-mapped trie; ingestion is a lock-free compare-and-swap (count = max, idempotent).
  3. Chunked transactions with SET semantics β€” a prefix commits in fixed-size chunks (--tx-chunk-size, default 500 K); re-importing a prefix after a crash overwrites with identical values, so recovery is idempotent.
  4. Crash-safe periodic checkpoints (a cron loop) snapshot progress and flush overlays.
  5. Overlay-heap eviction β€” the load-bearing fix. After each checkpoint, the tail evicts a shard's coldest resident overlay nodes down to its slice of the budget. Eviction is lossless: an evicted node faults back from the durable on-disk image on the next read.
  6. Supporting actors β€” a single-merge vocabulary WAL (avoids a doubled reverse-index rebuild), CX path-compressed checkpoints, and the mimalloc allocator (which alone drops __mprotect from ~49 % to background noise).

The eviction tail, literately:

function checkpoint_tail(shard, G, resident_shards):
    publish_durable(shard.overlay)                 β–Έ write the overlay snapshot to the durable image
    budget ← max(G / resident_shards, 64 MiB)      β–Έ this shard's slice of --overlay-budget-gib (M10)
    while resident_bytes(shard) > budget:
        node ← coldest_resident(shard.overlay)
        assert node ∈ durable_image(shard)         β–Έ SAFETY: never evict an un-persisted node …
        drop_resident(node)                        β–Έ … so the drop is lossless (faults back on read)
        resident_bytes(shard) βˆ’= sizeof(node)
    β–Έ invariant: evicted ⇔ durable. This is exactly what the TLA+ specs
    β–Έ CheckpointStateMachine / PersistentStorageBridge machine-check (see Formal Verification).

These are not just "well-tested" invariants β€” they are machine-checked (Formal Verification). β†’ deep dive: docs/architecture/memory-optimization.md Β· docs/cli/import-google-books.md.

Approximate Retrieval: HNSW vs Exact Cosine

What & why. Semantic retrieval (the R in RAG, Retrieval-Augmented Generation) reduces to nearest-neighbor search in embedding space: rank documents by cosine similarity to a query vector. For unit-normalized vectors this is just a dot product:

cos(a, b) = (aΒ·b) / (β€–aβ€–Β·β€–bβ€–) = Γ’ Β· bΜ‚                                    (M7)

libgrammstein offers two backends with a clean accuracy/latency trade-off (features: rag, rag-hnsw):

Backend Query cost Build cost Recall Use when
Exact cosine (ndarray/BLAS) O(nΒ·d) O(n) 100 % n ≲ 10⁢ documents
HNSW [7] O(log n) O(nΒ·log n) tunable via Ξ΅ n ≳ 10⁢ documents

How HNSW works. A Hierarchical Navigable Small-World graph stacks proximity graphs in layers β€” sparse at the top, dense at the bottom (a skip-list for vectors). Search starts coarse and refines downward:

function hnsw_search(q, k, ef):                    β–Έ ef = breadth knob (recall vs speed)
    v ← entry_point(top_layer)
    for layer from top down to 1:                  β–Έ coarse-to-fine descent
        v ← greedy_nearest(q, from = v, in = layer)β–Έ hop to the closest neighbor until no improvement
    C ← beam_search(q, from = v, in = layer_0, breadth = ef)   β–Έ widen at the base layer
    return top_k(C, k)
    β–Έ log-time because each upper layer halves the remaining search horizon.

ef_search, M (neighbors/node), and ef_construction trade recall for speed/memory; the backend builds lazily on first query and is thread-safe. β†’ deep dive: docs/components/rag/backend.md.

Code Property Graphs & Correction

What & why. Correcting source code needs syntax, control flow, and data flow at once. A Code Property Graph (CPG) [9] fuses three classic representations into one joint structure — the AST (Abstract Syntax Tree: nesting), the CFG (Control-Flow Graph: branch/loop edges), and the DFG (Data-Flow Graph: definition→use edges) — so a single traversal sees structure and behavior together. (features: code, code-{python,rust,javascript,rholang,metta})

Code Property Graph + correction

How. A tree-sitter incremental parse builds the CPG; an ensemble of correctors then proposes ranked fixes:

function correct(source):
    cpg ← build_cpg(tree_sitter_parse(source))     β–Έ AST βˆͺ CFG βˆͺ DFG in one graph
    candidates ←  lexical(cpg)                      β–Έ fuzzy match vs dictionary (liblevenshtein)
               βˆͺ  grammar(cpg)                      β–Έ PCFG + Earley: is this a likely production?
               βˆͺ  semantic(cpg)                     β–Έ 🟨 GNN over the CPG: variable misuse, dead code
    return rank_by_confidence(candidates)

The grammar corrector scores edits under a Probabilistic Context-Free Grammar (PCFG) with an Earley parser; the semantic corrector runs a Graph Neural Network (GNN) [10] over the CPG. The GNN path is 🟨 experimental β€” a natural contribution surface. The same machinery powers grammar-constrained decoding and WFST export, plus paradigm detection (PrefixSpan [13]), subtree mining (TreeMiner [14]), and pretrained code embeddings (CodeT5+ / UniXcoder / GraphCodeBERT, via ONNX). β†’ deep dive: docs/components/code/cpg.md.

Topic Modeling (BERTopic-style)

What & why. Given a corpus of document embeddings, recover human-readable topics without a fixed topic count. libgrammstein follows the BERTopic recipe [8]: embed β†’ cluster β†’ describe. (feature: rag)

How.

  1. Cluster embeddings with Hierarchical Agglomerative Clustering (HAC) β€” iteratively merge the closest clusters under a Lance-Williams linkage (single / complete / average / Ward), yielding a dendrogram you can cut at any granularity.
  2. Describe each cluster c with class-based TF-IDF (c-TF-IDF) β€” TF-IDF computed per-cluster rather than per-document, so the top-scoring terms are the cluster's keywords:
c-TF-IDF(t, c) = tf(t, c) Β· log( 1 + A / f(t) )                          (M8)
   tf(t, c) = frequency of term t in cluster c      A = average words per cluster
   f(t)     = number of clusters containing term t

A term scores high when it is frequent in its cluster yet rare across clusters β€” exactly the words that name a topic. β†’ deep dive: docs/components/topic/ctfidf.md.


The Wider Toolkit

Beyond the flagships, libgrammstein spans the full NLP/ML pipeline. Each item below is a one-line intuition + a deep-doc link; the Capability Matrix is the at-a-glance view.

Embeddings & Subword Models 🟩

FastText subword embeddings [4] (skip-gram [3] + negative sampling) with a fluent builder; BPE subword tokenization [5]; phonetic embeddings (Zompist-style sound-change rules for sound-alike similarity); and acoustic word embeddings from MFCC/mel-filterbank features [6] (optionally a Candle neural encoder).

use libgrammstein::embedding::EmbeddingTrainerBuilder;
let embedding = EmbeddingTrainerBuilder::new()
    .dim(100).window_size(5).min_count(5).epochs(10)
    .train(reader)?;                 // or .train_streaming(|| reopen_reader()) for >500 MB corpora
let neighbors = embedding.most_similar("running", 10);

β†’ docs/components/embedding/overview.md Β· docs/components/acoustic/features.md

Neural Rescoring & Summarization (ModernBERT) 🟨

A ModernBERT [11] wrapper (Candle + Hugging Face Hub) that rescores n-best beam paths by MLM pseudo-perplexity (mask each token, score the prediction), produces document embeddings, and runs extractive summarization via Maximal Marginal Relevance (MMR) [12] β€” a relevance/novelty trade-off that suppresses redundancy. Bring your own weights (pulled from HF Hub at runtime). (feature: neural-rescore) β†’ docs/components/neural/overview.md

LaTeX-Aware Modeling 🟨

A mode-aware tokenizer distinguishes math mode from text mode, feeding mode-weighted n-gram models, command embeddings, optional ModernBERT rescoring, and equation retrieval. (features: latex, latex-neural, latex-rag) β†’ docs/components/


Integrations

libgrammstein implements lling-llang's LanguageModel trait so any model here can rescore a WFST lattice [15] (lazy or eager transducer export). Its storage rests on two sibling crates: liblevenshtein (trie dictionaries + fuzzy matching) and libdictenstein (the persistent ARTrie: lock-free overlay, CX path-compressed checkpoints, and the overlay-heap eviction the importer relies on).

use libgrammstein::integration::GrammsteinLanguageModel;
use lling_llang::layers::{LanguageModelLayer, SpellingCorrectionLayer, LayerPipelineBuilder};

let lm = GrammsteinLanguageModel::from_hybrid(hybrid);          // wrap any model
let pipeline = LayerPipelineBuilder::new()
    .add_layer(SpellingCorrectionLayer::new(dictionary, 2))     // fuzzy candidates (liblevenshtein)
    .add_layer(LanguageModelLayer::new(Box::new(lm), 1.0))      // rescore with libgrammstein
    .build();
// "teh quikc brwon fox" β†’ "the quick brown fox"

β†’ docs/integration/lling-llang/overview.md


Formal Verification

The importer's correctness-sensitive machinery β€” concurrency, crash recovery, checkpoint durability, and query semantics β€” is machine-checked, not merely tested. This is a core differentiator: the lossless-eviction invariant from the importer is a proved property.

Formal verification map

  • TLA+ (Temporal Logic of Actions) β€” 7 live specifications model the protocols and are checked by TLC (bounded model checking) and typechecked by Apalache:

    Concern Specification(s) Verifies (examples)
    Concurrency AsyncShardSync at most one syncer; clean β‡’ zero dirty
    Lifecycle ImporterLifecycle, WorkerShutdown, CronStateMachine phase ordering; no job lost; termination
    Durability CheckpointStateMachine, PersistentStorageBridge, QuerySemanticsBridge no-loss publish; lossless eviction; no metadata leak
  • TLAPS β€” machine-checked inductive proofs of the specs (e.g. AsyncShardSync: 249 proof obligations discharged).

  • Rocq β€” MknStatistics / MknFloatBounds bound the MKN discounts and their binary64 evaluation; FrequencyCountsMerge proves count-merge associativity/commutativity.

  • loom β€” exhaustively explores concurrent memory-ordering interleavings in the Rust code.

  • Dependency bridges connect these specs to libdictenstein/liblevenshtein durability contracts, so the high-level proofs rest on verified storage.

Reproduce the whole gate:

make -C formal complete-with-dependencies          # TLC + TLAPS + Apalache + Rocq + loom + alignment

β†’ formal/README.md


CLI & TUI

The grammstein binary (feature cli) drives training, evaluation, model inspection, corpus processing, and the Google Books import β€” the last with a 🟨 live ratatui terminal UI showing per-worker progress, checkpoint state, and vocabulary growth.

grammstein train ngram corpus.txt --order 5 -o model.bin   # train
grammstein eval perplexity model.bin dev.txt               # evaluate
grammstein train import-google-books en.artrie --language en --parallel 8   # import (+ TUI)

Capability Matrix

Every subsystem, its Cargo feature(s), core algorithm, maturity, and deep doc.

Subsystem What it does Cargo feature(s) Key algorithm / tech Maturity Deep doc
N-gram LM next-word probability default Modified Kneser-Ney [1,2] 🟦 ngram
Hybrid model n-gram βŠ• embeddings default 4 interpolation strategies 🟩 hybrid
Embeddings dense word vectors default FastText subword [3,4] 🟩 embedding
BPE subword tokenizer subword Byte-Pair Encoding [5] 🟩 bpe
Acoustic audio features / encoders acoustic, candle-model MFCC / mel-filterbank [6] 🟩 acoustic
Neural rescore MLM rescoring, embeddings neural-rescore ModernBERT [11] 🟨 neural
Summarization extractive summaries neural-rescore MMR [12] 🟨 summarizer
RAG semantic retrieval rag, rag-hnsw cosine / HNSW [7] 🟩 rag
Topic modeling unsupervised topics rag HAC + c-TF-IDF [8] 🟩 topic
Code correction multi-language fixes code, code-* CPG [9] + PCFG + GNN [10] 🟩 / 🟨 code
Code embeddings neural code vectors code-neural CodeT5+ / UniXcoder / GraphCodeBERT (ONNX) 🟨 code-emb
Paradigm detection OOP/FP/… inference code PrefixSpan [13] 🟨 paradigm
Subtree mining frequent code patterns code TreeMiner [14] 🟨 subtree
LaTeX math-aware modeling latex, latex-* mode-aware n-gram + neural 🟨 latex
Importer Google Books ingestion google-books sharded ARTrie + eviction 🟦 memory
Persistent storage crash-safe tries (via libdictenstein) overlay + CX + eviction 🟦 data-flow
WFST integration lattice rescoring lling-llang-integration WFST export [15] 🟩 lling-llang
CLI / TUI command-line tooling cli clap + ratatui 🟩 / 🟨 cli
Formal verification machine-checked proofs (in formal/) TLA+ · TLAPS · Apalache · Rocq 🟦 formal

Performance

Indicative single-thread microbenchmarks (hardware-specific; reproduce with benches/ β€” ngram_query, mkn_frequency_counts, overlay_eviction, checkpoint_ops, varint_encoding, hash_comparison):

Operation Time
N-gram probability query ~100 ns
Embedding lookup (cached) ~10 ns
Hybrid score ~1 Β΅s
Sentence score (10 tokens) ~10 Β΅s

The Google Books importer's headline result is bounded peak heap (33.79 GB β†’ < 16 GB) and near-zero __mprotect overhead β€” see the importer.


Project Layout

src/
β”œβ”€β”€ ngram/        # Modified Kneser-Ney n-gram model, varint vocabulary, trie storage
β”œβ”€β”€ embedding/    # FastText subword Β· BPE Β· phonetic Β· acoustic Β· GPU
β”œβ”€β”€ hybrid/       # n-gram βŠ• embedding interpolation + OOV handling
β”œβ”€β”€ scoring/      # perplexity, sentence scoring        Β· generation/  # sampling
β”œβ”€β”€ corpus/       # streaming readers (Wikipedia, Gutenberg, plaintext), dedup, quality
β”œβ”€β”€ dictionary/   # word extraction, spelling dictionaries
β”œβ”€β”€ sources/      # Google Books importer: sharding Β· storage Β· checkpoint Β· eviction  [google-books]
β”œβ”€β”€ neural/       # ModernBERT embedder Β· rescorer Β· summarizer                        [neural-rescore]
β”œβ”€β”€ rag/          # retrieval index: exact cosine Β· HNSW                               [rag]
β”œβ”€β”€ topic/        # HAC clustering Β· c-TF-IDF Β· dendrogram                             [rag]
β”œβ”€β”€ code/         # tree-sitter Β· CPG Β· PCFG Β· GNN Β· constrained decoding              [code]
β”œβ”€β”€ latex/        # mode-aware tokenizer Β· n-gram Β· embeddings Β· equation RAG          [latex]
β”œβ”€β”€ language/     # whatlang detection + language-aware tokenization
β”œβ”€β”€ integration/  # lling-llang LanguageModel trait + WFST export                      [lling-llang-integration]
└── cli/          # the `grammstein` binary + ratatui TUI                              [cli]
formal/           # TLA+ / TLAPS / Apalache specs + Rocq proofs + loom tests
docs/             # deep component documentation (see docs/README.md)

References

DOIs are linked and verified; arXiv-only works link their abstract page.

  1. R. Kneser & H. Ney (1995). Improved backing-off for M-gram language modeling. ICASSP '95, 181–184. doi:10.1109/ICASSP.1995.479394
  2. S. F. Chen & J. Goodman (1999). An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13(4), 359–393. doi:10.1006/csla.1999.0128
  3. T. Mikolov, I. Sutskever, K. Chen, G. Corrado & J. Dean (2013). Distributed Representations of Words and Phrases and their Compositionality. NeurIPS. arXiv:1310.4546
  4. P. Bojanowski, E. Grave, A. Joulin & T. Mikolov (2017). Enriching Word Vectors with Subword Information (FastText). TACL 5, 135–146. doi:10.1162/tacl_a_00051
  5. R. Sennrich, B. Haddow & A. Birch (2016). Neural Machine Translation of Rare Words with Subword Units (BPE). ACL, 1715–1725. doi:10.18653/v1/P16-1162
  6. S. Davis & P. Mermelstein (1980). Comparison of parametric representations for monosyllabic word recognition… (MFCC). IEEE TASSP 28(4), 357–366. doi:10.1109/TASSP.1980.1163420
  7. Yu. A. Malkov & D. A. Yashunin (2020). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE TPAMI 42(4), 824–836. doi:10.1109/TPAMI.2018.2889473
  8. M. Grootendorst (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv:2203.05794
  9. F. Yamaguchi, N. Golde, D. Arp & K. Rieck (2014). Modeling and Discovering Vulnerabilities with Code Property Graphs. IEEE S&P, 590–604. doi:10.1109/SP.2014.44
  10. T. N. Kipf & M. Welling (2017). Semi-Supervised Classification with Graph Convolutional Networks. ICLR. arXiv:1609.02907
  11. B. Warner et al. (2024). Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder… (ModernBERT). arXiv:2412.13663
  12. J. Carbonell & J. Goldstein (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. ACM SIGIR, 335–336. doi:10.1145/290941.291025
  13. J. Pei et al. (2004). Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach. IEEE TKDE 16(11), 1424–1440. doi:10.1109/TKDE.2004.77
  14. M. J. Zaki (2005). Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications (TreeMiner). IEEE TKDE 17(8), 1021–1035. doi:10.1109/TKDE.2005.125
  15. M. Mohri, F. Pereira & M. Riley (2002). Weighted finite-state transducers in speech recognition. Computer Speech & Language 16(1), 69–88. doi:10.1006/csla.2001.0184

The RAG retrieval layer is inspired by retrieval-augmented architectures (Lewis et al., 2020, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, arXiv:2005.11401); libgrammstein implements the retrieval/indexing half, not end-to-end generation.


License & Related

Licensed under MIT OR Apache-2.0.

  • liblevenshtein-rust β€” fuzzy matching & trie dictionaries
  • libdictenstein β€” persistent adaptive-radix trie with lock-free overlay & eviction
  • lling-llang β€” WFST framework for text correction
  • F1R3FLY.io β€” distributed computing platform

About

A multi-paradigm, formally-verified language-modeling toolkit in Rust

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors