snipsBench: SNIPS slot-filling benchmark for action-grammar by steveluc · Pull Request #2462 · microsoft/TypeAgent

steveluc · 2026-06-08T22:40:40Z

What

Adds examples/snipsBench, a benchmark of the action-grammar engine on the
SNIPS NLU task (intent + slot
filling). It was built to answer a focused question for TypeAgent:

Does cheap, finite-state "parsing" — coarse POS / noun-phrase chunking —
improve grammar-based slot filling?

Answer: no. Boundaries are pinned by anchors (hand-written carrier words,
or learned carrier phrases), so POS typing is redundant. The lever that matters
is carrier-phrase coverage, obtained far more cheaply by inducing the
grammar from data than by hand-authoring it.

Method

The harness is self-verifying: a scorer self-test (hand-computed P/R/F1) and an
oracle (gold-as-pred) that must score 100/100 before any number is trusted.
Metric: CoNLL entity-level slot F1, per intent (intent given) micro-pooled.

Three arms over an identical grammar, varying only the slot wildcard type:

wildcard — greedy, stops at the next literal anchor
NP — greedy, stops at the first function word (a first-class entity type
registered on the engine, backed by a closed-class lexicon + suffix tagger)
title-aware — positional refinement of the wildcard capture

Results (pooled slot F1)

arm	M2 hand-authored	M3 induced (minFreq=2)
`wildcard`	23.4	35.3
`NP`	11.6	13.5
`title-aware`	23.5	34.0

Induction beats hand-authoring by +12 F1 and learns slot labels from
carrier context (GetWeather location F1 4.0 → 26.7).
NP collapses recall in every setting (highest precision, but real slots
are titles/names full of function words). title-aware is neutral-to-negative.
wildcard > title-aware > NP holds at every minFreq threshold.

Full writeup, per-intent tables, and implications for a lightweight learned
translation model are in the package
README.

Notes

New standalone example package; no changes to existing packages. pnpm-lock.yaml
updated for the new workspace member.
Vendors the SNIPS BIO split (Goo et al. 2018) under data/ with provenance in
data/SOURCE.md.
Surfaces an action-grammar bug: optional rule-reference <Rule>? silently
fails to match (inline optional group works) — filed as actionGrammar: optional rule-reference <Rule>? silently fails to match (inline optional group works) #2461.

🤖 Generated with Claude Code

Adds examples/snipsBench, a benchmark of the action-grammar engine on the SNIPS NLU task, used to test whether cheap finite-state "parsing" (coarse POS / NP chunking) improves grammar-based slot filling. - Verified harness: scorer self-test + oracle (gold-as-pred scores 100/100) before any number is trusted. CoNLL entity-level slot F1 + intent accuracy. - Three arms over an identical grammar: plain `wildcard`, strict `NP` (function-word-bounded, registered as an engine entity type), and a positional `title-aware` refinement. - M2: seven hand-authored intent grammars + scoreboard. - M3: grammar induction by delexicalization (train carrier-phrase templates), which beats hand-authoring (pooled slot F1 23.4 -> 35.3) and learns slot labels from context. Finding: POS/NP typing does not help slot boundaries in either regime — boundaries are pinned by anchors (hand-written or learned carrier phrases), so the signal is redundant. The lever is carrier-phrase coverage, obtained cheaply via induction. Full writeup in the package README. Also surfaces an action-grammar bug (optional rule-reference `<Rule>?` fails to match; inline optional group works) — see #2461. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

steveluc requested a deployment to development-fork June 8, 2026 22:40 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

snipsBench: SNIPS slot-filling benchmark for action-grammar#2462

snipsBench: SNIPS slot-filling benchmark for action-grammar#2462
steveluc wants to merge 1 commit into
mainfrom
snips-action-grammar-benchmark

steveluc commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

steveluc commented Jun 8, 2026

What

Method

Results (pooled slot F1)

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant