ExaForce · will-exaforce · Jun 19, 2026 · smoy · Jun 22, 2026
diff --git a/benchmark/.gitignore b/benchmark/.gitignore
@@ -0,0 +1,6 @@
+# Benchmark project: local venv, caches, and harness output DBs.
+.venv/
+.ruff_cache/
+__pycache__/
+# DuckDB result files written by the harness (benchmark_<id>.duckdb, -o targets).
+*.duckdb
diff --git a/benchmark/README.md b/benchmark/README.md
@@ -0,0 +1,128 @@
+# SkillSpector Benchmark
+
+A standalone harness that runs SkillSpector over a benchmark dataset
+([MalSkillBench](https://github.com/NVIDIA/MalSkillBench) today) and scores its
+classifications into DuckDB, so you can measure precision/recall/accuracy.
+
+## Why this is a separate project
+
+This is a **dev/eval tool**, not part of the shipped `skillspector` package. It
+lives in its own uv project with its own `pyproject.toml`, lockfile, and venv,
+and depends on `skillspector` as an **editable path dependency** (`../`). That
+means:
+
+- Nothing here touches the root `pyproject.toml` / lockfile, so upstream merges
+  from `NVIDIA/skillspector` stay conflict-free.
+- The harness-only dependencies (`duckdb`, `filelock`, `tqdm`,
+  `aws-bedrock-token-generator`) never ship to `skillspector` end users.
+- Edits to `src/skillspector` are picked up immediately (editable install), so
+  you always benchmark your working tree.
+
+## Running
+
+From the repo root (no `cd` needed):
+
+```bash
+uv run --directory benchmark benchmark /path/to/MalSkillBench/Dataset
+```
+
+Or from inside this directory:
+
+```bash
+cd benchmark
+uv run benchmark /path/to/MalSkillBench/Dataset
+uv run benchmark .../Dataset/Prompts/indirect-injection -o out.duckdb
+uv run benchmark .../Dataset/Skills/malware --limit 50 --workers 8
+uv run benchmark .../Dataset --no-llm        # static analysis only
+```
+
+> Note: the console-script name (`benchmark`) intentionally matches the package
+> directory, so there is **no** `__main__.py` — adding one would make `uv run
+> benchmark` execute the directory instead of the installed entry point. Invoke
+> it via the `benchmark` command (as above) or `.venv/bin/benchmark` directly.
+
+### Key flags
+
+| Flag | Meaning |
+|------|---------|
+| `-o, --output` | DuckDB output file (default `benchmark_<id>.duckdb`) |
+| `--no-llm` | static analysis only (no LLM calls) |
+| `--categories` | comma-separated unit categories to scan; default `skill,code`. **Prompts are excluded by default** — they're raw prompt-injection samples, not the repo config files this tool classifies (PI embedded in a skill is still covered via the `skill` units). Pass `--categories skill,code,prompt` to include them. |
+| `--limit N` | cap units **per group** — a unit's parent directory (e.g. `--limit 20` on `Dataset/Skills` keeps 20 from `Skills/malware` *and* 20 from `Skills/benign`); 0 = no cap |
+| `--workers N` | concurrent scan processes (default 8) |
+| `--overwrite` | start fresh instead of resuming an existing DB |
+| `--auth-wait-seconds` | how long to pause for `aws sso login` on a mid-run SSO expiry |
+
+A run is **resumable**: re-running the same command against an existing output
+DB skips already-classified units and re-scans only failures.
+
+## Output
+
+DuckDB tables `runs`, `units` (ground truth), `classifications` (SkillSpector
+verdict), `issues`, `components`, plus an `evaluation` view labeling every scan
+`TP`/`FP`/`TN`/`FN`/`ERROR`:
+
+```bash
+duckdb benchmark_<id>.duckdb "SELECT outcome, count(*) FROM evaluation GROUP BY 1"
+```
+
+## Inspecting classifier performance
+
+The `queries/` directory holds ready-made DuckDB queries for evaluating how well
+SkillSpector classified — precision/recall/F1 breakdowns, the false-negative and
+false-positive lists worth eyeballing, run-health checks, and tuning aids. Run
+one against an output DB with `.read`:
+
+```bash
+duckdb -readonly benchmark_<id>.duckdb ".read queries/01_overview.sql"
+```
+
+Each file is self-documenting (header comment explains what it shows) and, except
+`01_overview.sql`, defaults to the **most recent run** in the DB — edit the `run`
+CTE at the top to target a different run or span all of them.
+
+| Query | What it answers |
+|-------|-----------------|
+| `01_overview.sql` | Per-run config + confusion matrix + precision/recall/F1/accuracy (compare runs) |
+| `02_metrics_by_category.sql` | Metrics split by skill / code / prompt |
+| `03_metrics_by_attack_vector.sql` | Metrics split by CI / PI / MIXED |
+| `04_metrics_by_corpus.sql` | Metrics split by source corpus |
+| `05_recall_by_behavior.sql` | Which attack behaviors (B1..B15) evade detection, worst first |
+| `06_false_negatives.sql` | Malware that was cleared — the critical misses |
+| `07_false_positives.sql` | Benign units flagged, with the rules that fired |
+| `08_errors.sql` | Scans that failed (invalidate metrics if high) |
+| `09_classification_status.sql` | LLM vs static vs **fallback** (catches "LLM didn't actually run") |
+| `10_risk_score_distribution.sql` | Score histogram by ground truth — does the score separate classes? |
+| `11_threshold_sweep.sql` | P/R/F1 across hypothetical `risk_score` cutoffs (tune the verdict) |
+| `12_scan_timing.sql` | Timing percentiles + slowest scans |
+| `13_label_coverage.sql` | How ground-truth labels were resolved (bounds trust in 03/05) |
+| `14_top_rules.sql` | Which rules fire, on malicious vs benign (false-positive drivers) |
+
+## Layout
+
+```
+benchmark/
+  main.py                  # argparse + run orchestration
+  config.py                # provider/run constants + parent-process env wiring
+  models.py                # Unit, ScanResult
+  auth.py                  # Bedrock bearer-token manager (cross-process cache)
+  db.py                    # DuckDB schema + result persistence
+  runner.py                # the classifier seam: prepare a unit, run SkillSpector
+  utils.py                 # cross-cutting helpers
+  dataset_handler/
+    base.py                # DatasetHandler abstraction
+    malskillbench.py       # MalSkillBench implementation
+    __init__.py            # handler registry + discover() dispatch
+```
+
+## Adding another dataset
+
+Discovery is behind an interface so the harness isn't locked to MalSkillBench:
+
+1. Subclass `DatasetHandler` (in `dataset_handler/base.py`), implementing
+   `matches(root)` and `discover(root) -> list[Unit]`.
+2. Register it in `dataset_handler/__init__.py`'s `_HANDLERS` list (most
+   specific first; MalSkillBench is the default fallback).
+
+Everything downstream (runner, DB, scoring) works in terms of `Unit` /
+`ScanResult`, so a new handler is the only code a new benchmark needs.
diff --git a/benchmark/benchmark/__init__.py b/benchmark/benchmark/__init__.py
@@ -0,0 +1,5 @@
+"""SkillSpector benchmark harness.
+
+Runs SkillSpector over a benchmark dataset (MalSkillBench today) and scores its
+classifications into DuckDB. See ``main.py`` for the CLI entrypoint.
+"""
diff --git a/benchmark/benchmark/auth.py b/benchmark/benchmark/auth.py
@@ -0,0 +1,161 @@
+"""Bedrock bearer-token management, shared across worker processes.
+
+A bedrock bearer token expires (~12h); a long run outlasts it. Workers can't
+share an AWS session object (botocore credentials don't survive ``spawn``), but
+they CAN share the minted token string through a small 0600 cache file. The
+first worker to find it missing/stale re-mints under a file lock (double-
+checked) and writes it; everyone else just reads the cached value. botocore
+itself already shares/refreshes the underlying AWS credentials cross-process
+via its own on-disk cache, so this layer only dedupes the token signing.
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import pathlib
+import sys
+import tempfile
+import time
+from datetime import UTC, datetime, timedelta
+
+_TOKEN_TTL = timedelta(hours=12)
+_TOKEN_REFRESH_MARGIN = timedelta(minutes=30)
+_AUTH_RETRY_INTERVAL = 15  # seconds between re-mint attempts while paused
+
+
+class AuthAbortedError(Exception):
+    """AWS credentials could not be restored within the pause window; give up."""
+
+
+class BedrockTokenManager:
+    """Cross-process cache for a Bedrock bearer token (file + file lock).
+
+    Re-minting near expiry resolves AWS credentials fresh, so the common case --
+    short-lived assume-role/STS creds aging out while the SSO *session* is still
+    valid -- self-heals (botocore refreshes via STS). When the SSO *session*
+    itself has expired, minting needs a human ``aws sso login``; ``token()``
+    then pauses (holding the lock so peers queue), retries until creds return,
+    and -- if the wait window is exceeded -- drops an abort marker so every
+    worker fails fast instead of each blocking for the full window.
+    """
+
+    def __init__(
+        self,
+        region: str,
+        cache_path: pathlib.Path | None = None,
+        ttl: timedelta = _TOKEN_TTL,
+        refresh_margin: timedelta = _TOKEN_REFRESH_MARGIN,
+    ) -> None:
+        self.region = region
+        self.ttl = ttl
+        self.refresh_margin = refresh_margin
+        self.auth_wait = float(os.environ.get("SKILLSPECTOR_BENCH_AUTH_WAIT", "1800"))
+        default = pathlib.Path(tempfile.gettempdir()) / f"skillspector_bench_token_{region}.json"
+        self.cache_path = pathlib.Path(
+            os.environ.get("SKILLSPECTOR_BENCH_TOKEN_CACHE", cache_path or default)
+        )
+        self.lock_path = self.cache_path.with_suffix(".lock")
+        self.abort_path = self.cache_path.with_suffix(".abort")
+        self._mem: tuple[str, datetime] | None = None  # in-process fast path
+
+    def token(self, wait: bool = False) -> str:
+        """Return a valid token, refreshing the shared cache if stale.
+
+        ``wait=True`` (workers) pauses and retries on an auth failure so a
+        mid-run SSO expiry can be fixed with ``aws sso login`` without losing
+        progress. ``wait=False`` (parent startup) fails fast so a bad-creds run
+        aborts immediately instead of hanging.
+        """
+        if self.abort_path.exists():
+            raise AuthAbortedError("credential refresh already abandoned this run")
+        now = datetime.now(UTC)
+        if self._mem and now < self._mem[1]:
+            return self._mem[0]
+        cached = self._read()
+        if cached and now < cached[1]:
+            self._mem = cached
+            return cached[0]
+        # Missing/stale: exactly one process refreshes; others queue on the lock.
+        from filelock import FileLock
+
+        with FileLock(str(self.lock_path)):
+            if self.abort_path.exists():
+                raise AuthAbortedError("credential refresh already abandoned this run")
+            cached = self._read()
+            if cached and datetime.now(UTC) < cached[1]:
+                self._mem = cached
+                return cached[0]
+            return self._mint_locked(wait)
+
+    def _mint_locked(self, wait: bool) -> str:
+        """Mint under the held lock; pause-retry on auth failure when waiting."""
+        deadline = time.monotonic() + self.auth_wait
+        announced = False
+        while True:
+            try:
+                fresh = self._mint()
+            except Exception as e:  # noqa: BLE001 - any mint failure is retryable
+                if not wait:
+                    raise  # startup: surface immediately to the caller
+                if time.monotonic() >= deadline:
+                    self.abort_path.touch()  # signal all workers to stop fast
+                    raise AuthAbortedError(
+                        f"AWS credentials not restored within {int(self.auth_wait)}s ({e})"
+                    ) from e
+                if not announced:
+                    print(
+                        f"\n[paused] Bedrock token mint failed: {e}\n"
+                        "  Run `aws sso login` to restore credentials; the run will "
+                        f"resume automatically.\n  Retrying every {_AUTH_RETRY_INTERVAL}s "
+                        f"(giving up in {int(self.auth_wait / 60)}m).",
+                        file=sys.stderr,
+                        flush=True,
+                    )
+                    announced = True
+                time.sleep(_AUTH_RETRY_INTERVAL)
+                continue
+            if announced:
+                print("[resumed] AWS credentials restored.", file=sys.stderr, flush=True)
+            self._write(*fresh)
+            self._mem = fresh
+            return fresh[0]
+
+    def _mint(self) -> tuple[str, datetime]:
+        from aws_bedrock_token_generator import provide_token
+
+        token = provide_token(region=self.region, expiry=self.ttl)
+        # Refresh a margin before the real 12h expiry for safety.
+        return token, datetime.now(UTC) + self.ttl - self.refresh_margin
+
+    def clear_abort(self) -> None:
+        """Remove any stale abort marker from a previous run."""
+        self.abort_path.unlink(missing_ok=True)
+
+    def _read(self) -> tuple[str, datetime] | None:
+        try:
+            data = json.loads(self.cache_path.read_text(encoding="utf-8"))
+            return data["token"], datetime.fromisoformat(data["expires_at"])
+        except (OSError, json.JSONDecodeError, KeyError, ValueError):
+            return None
+
+    def _write(self, token: str, expires_at: datetime) -> None:
+        self.cache_path.parent.mkdir(parents=True, exist_ok=True)
+        tmp = self.cache_path.with_suffix(".tmp")
+        tmp.write_text(
+            json.dumps({"token": token, "expires_at": expires_at.isoformat()}),
+            encoding="utf-8",
+        )
+        os.chmod(tmp, 0o600)
+        os.replace(tmp, self.cache_path)  # atomic publish
+
+
+_TOKEN_MANAGER: BedrockTokenManager | None = None
+
+
+def token_manager(region: str) -> BedrockTokenManager:
+    """Per-process singleton token manager (state is shared on disk)."""
+    global _TOKEN_MANAGER
+    if _TOKEN_MANAGER is None or _TOKEN_MANAGER.region != region:
+        _TOKEN_MANAGER = BedrockTokenManager(region)
+    return _TOKEN_MANAGER
diff --git a/benchmark/benchmark/config.py b/benchmark/benchmark/config.py
@@ -0,0 +1,61 @@
+"""Provider/run configuration: stable constants and parent-process env wiring."""
+
+from __future__ import annotations
+
+import os
+
+from .auth import token_manager
+
+BEDROCK_BASE_URL = "https://bedrock-mantle.us-east-2.api.aws/v1"
+BEDROCK_REGION = "us-east-2"
+PROVIDER = "openai"
+MODEL_ID = "nvidia.nemotron-super-3-120b"
+SCAN_TIMEOUT_SECONDS = 300
+
+# SkillSpector's own "do not install" verdict == this recommendation string.
+MALICIOUS_RECOMMENDATION = "DO_NOT_INSTALL"
+
+
+def configure_run(no_llm: bool, timeout: float) -> tuple[dict, str, str, str]:
+    """Set stable env in the parent (inherited by spawned workers) and return
+    ``(worker_cfg, provider, model, region)``.
+
+    Stable, non-secret config (provider/model/base-url) is placed in the parent
+    ``os.environ`` so every spawned worker inherits it. The volatile bedrock
+    token is NOT set here -- each worker mints its own (see ``scan_worker``) so
+    it can't go stale across a multi-hour run.
+    """
+    # Inherited by spawned workers (new interpreters read these at startup), so
+    # SkillSpector/langchain import-time warnings and logs stay out of the bar.
+    os.environ.setdefault("PYTHONWARNINGS", "ignore")
+    os.environ.setdefault("SKILLSPECTOR_LOG_LEVEL", "ERROR")
+    cfg = {"use_llm": not no_llm, "mint_bedrock": False, "region": "", "timeout": timeout}
+    if no_llm:
+        return cfg, "", "", ""
+    if os.environ.get("SKILLSPECTOR_PROVIDER"):
+        # Caller already configured a provider; trust their env (inherited).
+        cfg["region"] = os.environ.get("AWS_REGION", "")
+        return (
+            cfg,
+            os.environ["SKILLSPECTOR_PROVIDER"],
+            os.environ.get("SKILLSPECTOR_MODEL", ""),
+            cfg["region"],
+        )
+    # Default path: OpenAI-compatible Bedrock endpoint with a minted token.
+    model = os.environ.get("MODEL_OVERRIDE") or os.environ.get("AGENTGUARD_MODEL_ID", MODEL_ID)
+    region = os.environ.get("AWS_REGION") or BEDROCK_REGION
+    manager = token_manager(region)
+    manager.clear_abort()  # drop any stale abort marker from a previous run
+    try:
+        manager.token()  # fail fast (wait=False) + warm the shared cache
+    except Exception as e:  # noqa: BLE001 - surface a clear, actionable hint
+        raise SystemExit(
+            f"Failed to mint a Bedrock token ({type(e).__name__}: {e}).\n"
+            "Refresh AWS credentials (e.g. `aws sso login`), set an existing "
+            "SKILLSPECTOR_PROVIDER + key in the env, or pass --no-llm."
+        ) from e
+    os.environ["SKILLSPECTOR_PROVIDER"] = PROVIDER
+    os.environ["SKILLSPECTOR_MODEL"] = model
+    os.environ["OPENAI_BASE_URL"] = BEDROCK_BASE_URL
+    cfg.update(mint_bedrock=True, region=region)
+    return cfg, PROVIDER, model, region
diff --git a/benchmark/benchmark/dataset_handler/__init__.py b/benchmark/benchmark/dataset_handler/__init__.py
@@ -0,0 +1,33 @@
+"""Dataset handler registry and dispatch.
+
+``discover(root)`` picks the first registered handler that recognizes the tree
+and delegates to it. Register additional benchmarks in ``_HANDLERS`` (most
+specific first); MalSkillBench is the default fallback.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+from ..models import Unit
+from .base import DatasetHandler
+from .malskillbench import MalSkillBenchHandler
+
+# Register dataset handlers here, most specific first.
+_HANDLERS: list[DatasetHandler] = [MalSkillBenchHandler()]
+
+
+def get_handler(root: Path) -> DatasetHandler:
+    """Return the handler that recognizes ``root`` (defaults to MalSkillBench)."""
+    for handler in _HANDLERS:
+        if handler.matches(root):
+            return handler
+    return _HANDLERS[0]
+
+
+def discover(root: Path) -> list[Unit]:
+    """Collect all scannable units under ``root`` via the matching handler."""
+    return get_handler(root).discover(root)
+
+
+__all__ = ["DatasetHandler", "discover", "get_handler"]