snipsBench: SNIPS slot-filling benchmark for action-grammar#2462
Open
steveluc wants to merge 1 commit into
Open
snipsBench: SNIPS slot-filling benchmark for action-grammar#2462steveluc wants to merge 1 commit into
steveluc wants to merge 1 commit into
Conversation
Adds examples/snipsBench, a benchmark of the action-grammar engine on the SNIPS NLU task, used to test whether cheap finite-state "parsing" (coarse POS / NP chunking) improves grammar-based slot filling. - Verified harness: scorer self-test + oracle (gold-as-pred scores 100/100) before any number is trusted. CoNLL entity-level slot F1 + intent accuracy. - Three arms over an identical grammar: plain `wildcard`, strict `NP` (function-word-bounded, registered as an engine entity type), and a positional `title-aware` refinement. - M2: seven hand-authored intent grammars + scoreboard. - M3: grammar induction by delexicalization (train carrier-phrase templates), which beats hand-authoring (pooled slot F1 23.4 -> 35.3) and learns slot labels from context. Finding: POS/NP typing does not help slot boundaries in either regime — boundaries are pinned by anchors (hand-written or learned carrier phrases), so the signal is redundant. The lever is carrier-phrase coverage, obtained cheaply via induction. Full writeup in the package README. Also surfaces an action-grammar bug (optional rule-reference `<Rule>?` fails to match; inline optional group works) — see #2461. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
examples/snipsBench, a benchmark of the action-grammar engine on theSNIPS NLU task (intent + slot
filling). It was built to answer a focused question for TypeAgent:
Answer: no. Boundaries are pinned by anchors (hand-written carrier words,
or learned carrier phrases), so POS typing is redundant. The lever that matters
is carrier-phrase coverage, obtained far more cheaply by inducing the
grammar from data than by hand-authoring it.
Method
The harness is self-verifying: a scorer self-test (hand-computed P/R/F1) and an
oracle (gold-as-pred) that must score 100/100 before any number is trusted.
Metric: CoNLL entity-level slot F1, per intent (intent given) micro-pooled.
Three arms over an identical grammar, varying only the slot wildcard type:
wildcard— greedy, stops at the next literal anchorNP— greedy, stops at the first function word (a first-class entity typeregistered on the engine, backed by a closed-class lexicon + suffix tagger)
title-aware— positional refinement of thewildcardcaptureResults (pooled slot F1)
wildcardNPtitle-awarecarrier context (GetWeather location F1 4.0 → 26.7).
NPcollapses recall in every setting (highest precision, but real slotsare titles/names full of function words).
title-awareis neutral-to-negative.wildcard > title-aware > NPholds at everyminFreqthreshold.Full writeup, per-intent tables, and implications for a lightweight learned
translation model are in the package
README.
Notes
pnpm-lock.yamlupdated for the new workspace member.
data/with provenance indata/SOURCE.md.<Rule>?silentlyfails to match (inline optional group works) — filed as actionGrammar: optional rule-reference
<Rule>?silently fails to match (inline optional group works) #2461.🤖 Generated with Claude Code