Skip to content

snipsBench: SNIPS slot-filling benchmark for action-grammar#2462

Open
steveluc wants to merge 1 commit into
mainfrom
snips-action-grammar-benchmark
Open

snipsBench: SNIPS slot-filling benchmark for action-grammar#2462
steveluc wants to merge 1 commit into
mainfrom
snips-action-grammar-benchmark

Conversation

@steveluc

@steveluc steveluc commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

What

Adds examples/snipsBench, a benchmark of the action-grammar engine on the
SNIPS NLU task (intent + slot
filling). It was built to answer a focused question for TypeAgent:

Does cheap, finite-state "parsing" — coarse POS / noun-phrase chunking —
improve grammar-based slot filling?

Answer: no. Boundaries are pinned by anchors (hand-written carrier words,
or learned carrier phrases), so POS typing is redundant. The lever that matters
is carrier-phrase coverage, obtained far more cheaply by inducing the
grammar from data than by hand-authoring it.

Method

The harness is self-verifying: a scorer self-test (hand-computed P/R/F1) and an
oracle (gold-as-pred) that must score 100/100 before any number is trusted.
Metric: CoNLL entity-level slot F1, per intent (intent given) micro-pooled.

Three arms over an identical grammar, varying only the slot wildcard type:

  • wildcard — greedy, stops at the next literal anchor
  • NP — greedy, stops at the first function word (a first-class entity type
    registered on the engine, backed by a closed-class lexicon + suffix tagger)
  • title-aware — positional refinement of the wildcard capture

Results (pooled slot F1)

arm M2 hand-authored M3 induced (minFreq=2)
wildcard 23.4 35.3
NP 11.6 13.5
title-aware 23.5 34.0
  • Induction beats hand-authoring by +12 F1 and learns slot labels from
    carrier context (GetWeather location F1 4.0 → 26.7).
  • NP collapses recall in every setting (highest precision, but real slots
    are titles/names full of function words). title-aware is neutral-to-negative.
  • wildcard > title-aware > NP holds at every minFreq threshold.

Full writeup, per-intent tables, and implications for a lightweight learned
translation model are in the package
README.

Notes

🤖 Generated with Claude Code

Adds examples/snipsBench, a benchmark of the action-grammar engine on the
SNIPS NLU task, used to test whether cheap finite-state "parsing" (coarse
POS / NP chunking) improves grammar-based slot filling.

- Verified harness: scorer self-test + oracle (gold-as-pred scores 100/100)
  before any number is trusted. CoNLL entity-level slot F1 + intent accuracy.
- Three arms over an identical grammar: plain `wildcard`, strict `NP`
  (function-word-bounded, registered as an engine entity type), and a
  positional `title-aware` refinement.
- M2: seven hand-authored intent grammars + scoreboard.
- M3: grammar induction by delexicalization (train carrier-phrase templates),
  which beats hand-authoring (pooled slot F1 23.4 -> 35.3) and learns slot
  labels from context.

Finding: POS/NP typing does not help slot boundaries in either regime —
boundaries are pinned by anchors (hand-written or learned carrier phrases),
so the signal is redundant. The lever is carrier-phrase coverage, obtained
cheaply via induction. Full writeup in the package README.

Also surfaces an action-grammar bug (optional rule-reference `<Rule>?` fails
to match; inline optional group works) — see #2461.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@steveluc steveluc requested a deployment to development-fork June 8, 2026 22:40 — with GitHub Actions Waiting
@steveluc steveluc requested a deployment to development-fork June 8, 2026 22:40 — with GitHub Actions Waiting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant