Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions benchmark/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Benchmark project: local venv, caches, and harness output DBs.
.venv/
.ruff_cache/
__pycache__/
# DuckDB result files written by the harness (benchmark_<id>.duckdb, -o targets).
*.duckdb
128 changes: 128 additions & 0 deletions benchmark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# SkillSpector Benchmark

A standalone harness that runs SkillSpector over a benchmark dataset
([MalSkillBench](https://github.com/NVIDIA/MalSkillBench) today) and scores its
classifications into DuckDB, so you can measure precision/recall/accuracy.

## Why this is a separate project

This is a **dev/eval tool**, not part of the shipped `skillspector` package. It
lives in its own uv project with its own `pyproject.toml`, lockfile, and venv,
and depends on `skillspector` as an **editable path dependency** (`../`). That
means:

- Nothing here touches the root `pyproject.toml` / lockfile, so upstream merges
from `NVIDIA/skillspector` stay conflict-free.
- The harness-only dependencies (`duckdb`, `filelock`, `tqdm`,
`aws-bedrock-token-generator`) never ship to `skillspector` end users.
- Edits to `src/skillspector` are picked up immediately (editable install), so
you always benchmark your working tree.

## Running

From the repo root (no `cd` needed):

```bash
uv run --directory benchmark benchmark /path/to/MalSkillBench/Dataset
```

Or from inside this directory:

```bash
cd benchmark
uv run benchmark /path/to/MalSkillBench/Dataset
uv run benchmark .../Dataset/Prompts/indirect-injection -o out.duckdb
uv run benchmark .../Dataset/Skills/malware --limit 50 --workers 8
uv run benchmark .../Dataset --no-llm # static analysis only
```

> Note: the console-script name (`benchmark`) intentionally matches the package
> directory, so there is **no** `__main__.py` — adding one would make `uv run
> benchmark` execute the directory instead of the installed entry point. Invoke
> it via the `benchmark` command (as above) or `.venv/bin/benchmark` directly.

### Key flags

| Flag | Meaning |
|------|---------|
| `-o, --output` | DuckDB output file (default `benchmark_<id>.duckdb`) |
| `--no-llm` | static analysis only (no LLM calls) |
| `--categories` | comma-separated unit categories to scan; default `skill,code`. **Prompts are excluded by default** — they're raw prompt-injection samples, not the repo config files this tool classifies (PI embedded in a skill is still covered via the `skill` units). Pass `--categories skill,code,prompt` to include them. |
| `--limit N` | cap units **per group** — a unit's parent directory (e.g. `--limit 20` on `Dataset/Skills` keeps 20 from `Skills/malware` *and* 20 from `Skills/benign`); 0 = no cap |
| `--workers N` | concurrent scan processes (default 8) |
| `--overwrite` | start fresh instead of resuming an existing DB |
| `--auth-wait-seconds` | how long to pause for `aws sso login` on a mid-run SSO expiry |

A run is **resumable**: re-running the same command against an existing output
DB skips already-classified units and re-scans only failures.

## Output

DuckDB tables `runs`, `units` (ground truth), `classifications` (SkillSpector
verdict), `issues`, `components`, plus an `evaluation` view labeling every scan
`TP`/`FP`/`TN`/`FN`/`ERROR`:

```bash
duckdb benchmark_<id>.duckdb "SELECT outcome, count(*) FROM evaluation GROUP BY 1"
```

## Inspecting classifier performance

The `queries/` directory holds ready-made DuckDB queries for evaluating how well
SkillSpector classified — precision/recall/F1 breakdowns, the false-negative and
false-positive lists worth eyeballing, run-health checks, and tuning aids. Run
one against an output DB with `.read`:

```bash
duckdb -readonly benchmark_<id>.duckdb ".read queries/01_overview.sql"
```

Each file is self-documenting (header comment explains what it shows) and, except
`01_overview.sql`, defaults to the **most recent run** in the DB — edit the `run`
CTE at the top to target a different run or span all of them.

| Query | What it answers |
|-------|-----------------|
| `01_overview.sql` | Per-run config + confusion matrix + precision/recall/F1/accuracy (compare runs) |
| `02_metrics_by_category.sql` | Metrics split by skill / code / prompt |
| `03_metrics_by_attack_vector.sql` | Metrics split by CI / PI / MIXED |
| `04_metrics_by_corpus.sql` | Metrics split by source corpus |
| `05_recall_by_behavior.sql` | Which attack behaviors (B1..B15) evade detection, worst first |
| `06_false_negatives.sql` | Malware that was cleared — the critical misses |
| `07_false_positives.sql` | Benign units flagged, with the rules that fired |
| `08_errors.sql` | Scans that failed (invalidate metrics if high) |
| `09_classification_status.sql` | LLM vs static vs **fallback** (catches "LLM didn't actually run") |
| `10_risk_score_distribution.sql` | Score histogram by ground truth — does the score separate classes? |
| `11_threshold_sweep.sql` | P/R/F1 across hypothetical `risk_score` cutoffs (tune the verdict) |
| `12_scan_timing.sql` | Timing percentiles + slowest scans |
| `13_label_coverage.sql` | How ground-truth labels were resolved (bounds trust in 03/05) |
| `14_top_rules.sql` | Which rules fire, on malicious vs benign (false-positive drivers) |

## Layout

```
benchmark/
main.py # argparse + run orchestration
config.py # provider/run constants + parent-process env wiring
models.py # Unit, ScanResult
auth.py # Bedrock bearer-token manager (cross-process cache)
db.py # DuckDB schema + result persistence
runner.py # the classifier seam: prepare a unit, run SkillSpector
utils.py # cross-cutting helpers
dataset_handler/
base.py # DatasetHandler abstraction
malskillbench.py # MalSkillBench implementation
__init__.py # handler registry + discover() dispatch
```

## Adding another dataset

Discovery is behind an interface so the harness isn't locked to MalSkillBench:

1. Subclass `DatasetHandler` (in `dataset_handler/base.py`), implementing
`matches(root)` and `discover(root) -> list[Unit]`.
2. Register it in `dataset_handler/__init__.py`'s `_HANDLERS` list (most
specific first; MalSkillBench is the default fallback).

Everything downstream (runner, DB, scoring) works in terms of `Unit` /
`ScanResult`, so a new handler is the only code a new benchmark needs.
5 changes: 5 additions & 0 deletions benchmark/benchmark/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
"""SkillSpector benchmark harness.

Runs SkillSpector over a benchmark dataset (MalSkillBench today) and scores its
classifications into DuckDB. See ``main.py`` for the CLI entrypoint.
"""
161 changes: 161 additions & 0 deletions benchmark/benchmark/auth.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
"""Bedrock bearer-token management, shared across worker processes.

A bedrock bearer token expires (~12h); a long run outlasts it. Workers can't
share an AWS session object (botocore credentials don't survive ``spawn``), but
they CAN share the minted token string through a small 0600 cache file. The
first worker to find it missing/stale re-mints under a file lock (double-
checked) and writes it; everyone else just reads the cached value. botocore
itself already shares/refreshes the underlying AWS credentials cross-process
via its own on-disk cache, so this layer only dedupes the token signing.
"""

from __future__ import annotations

import json
import os
import pathlib
import sys
import tempfile
import time
from datetime import UTC, datetime, timedelta

_TOKEN_TTL = timedelta(hours=12)
_TOKEN_REFRESH_MARGIN = timedelta(minutes=30)
_AUTH_RETRY_INTERVAL = 15 # seconds between re-mint attempts while paused


class AuthAbortedError(Exception):
"""AWS credentials could not be restored within the pause window; give up."""


class BedrockTokenManager:
"""Cross-process cache for a Bedrock bearer token (file + file lock).

Re-minting near expiry resolves AWS credentials fresh, so the common case --
short-lived assume-role/STS creds aging out while the SSO *session* is still
valid -- self-heals (botocore refreshes via STS). When the SSO *session*
itself has expired, minting needs a human ``aws sso login``; ``token()``
then pauses (holding the lock so peers queue), retries until creds return,
and -- if the wait window is exceeded -- drops an abort marker so every
worker fails fast instead of each blocking for the full window.
"""

def __init__(
self,
region: str,
cache_path: pathlib.Path | None = None,
ttl: timedelta = _TOKEN_TTL,
refresh_margin: timedelta = _TOKEN_REFRESH_MARGIN,
) -> None:
self.region = region
self.ttl = ttl
self.refresh_margin = refresh_margin
self.auth_wait = float(os.environ.get("SKILLSPECTOR_BENCH_AUTH_WAIT", "1800"))
default = pathlib.Path(tempfile.gettempdir()) / f"skillspector_bench_token_{region}.json"
self.cache_path = pathlib.Path(
os.environ.get("SKILLSPECTOR_BENCH_TOKEN_CACHE", cache_path or default)
)
self.lock_path = self.cache_path.with_suffix(".lock")
self.abort_path = self.cache_path.with_suffix(".abort")
self._mem: tuple[str, datetime] | None = None # in-process fast path

def token(self, wait: bool = False) -> str:
"""Return a valid token, refreshing the shared cache if stale.

``wait=True`` (workers) pauses and retries on an auth failure so a
mid-run SSO expiry can be fixed with ``aws sso login`` without losing
progress. ``wait=False`` (parent startup) fails fast so a bad-creds run
aborts immediately instead of hanging.
"""
if self.abort_path.exists():
raise AuthAbortedError("credential refresh already abandoned this run")
now = datetime.now(UTC)
if self._mem and now < self._mem[1]:
return self._mem[0]
cached = self._read()
if cached and now < cached[1]:
self._mem = cached
return cached[0]
# Missing/stale: exactly one process refreshes; others queue on the lock.
from filelock import FileLock

with FileLock(str(self.lock_path)):
if self.abort_path.exists():
raise AuthAbortedError("credential refresh already abandoned this run")
cached = self._read()
if cached and datetime.now(UTC) < cached[1]:
self._mem = cached
return cached[0]
return self._mint_locked(wait)

def _mint_locked(self, wait: bool) -> str:
"""Mint under the held lock; pause-retry on auth failure when waiting."""
deadline = time.monotonic() + self.auth_wait
announced = False
while True:
try:
fresh = self._mint()
except Exception as e: # noqa: BLE001 - any mint failure is retryable
if not wait:
raise # startup: surface immediately to the caller
if time.monotonic() >= deadline:
self.abort_path.touch() # signal all workers to stop fast
raise AuthAbortedError(
f"AWS credentials not restored within {int(self.auth_wait)}s ({e})"
) from e
if not announced:
print(
f"\n[paused] Bedrock token mint failed: {e}\n"
" Run `aws sso login` to restore credentials; the run will "
f"resume automatically.\n Retrying every {_AUTH_RETRY_INTERVAL}s "
f"(giving up in {int(self.auth_wait / 60)}m).",
file=sys.stderr,
flush=True,
)
announced = True
time.sleep(_AUTH_RETRY_INTERVAL)
continue
if announced:
print("[resumed] AWS credentials restored.", file=sys.stderr, flush=True)
self._write(*fresh)
self._mem = fresh
return fresh[0]

def _mint(self) -> tuple[str, datetime]:
from aws_bedrock_token_generator import provide_token

token = provide_token(region=self.region, expiry=self.ttl)
# Refresh a margin before the real 12h expiry for safety.
return token, datetime.now(UTC) + self.ttl - self.refresh_margin

def clear_abort(self) -> None:
"""Remove any stale abort marker from a previous run."""
self.abort_path.unlink(missing_ok=True)

def _read(self) -> tuple[str, datetime] | None:
try:
data = json.loads(self.cache_path.read_text(encoding="utf-8"))
return data["token"], datetime.fromisoformat(data["expires_at"])
except (OSError, json.JSONDecodeError, KeyError, ValueError):
return None

def _write(self, token: str, expires_at: datetime) -> None:
self.cache_path.parent.mkdir(parents=True, exist_ok=True)
tmp = self.cache_path.with_suffix(".tmp")
tmp.write_text(
json.dumps({"token": token, "expires_at": expires_at.isoformat()}),
encoding="utf-8",
)
os.chmod(tmp, 0o600)
os.replace(tmp, self.cache_path) # atomic publish


_TOKEN_MANAGER: BedrockTokenManager | None = None


def token_manager(region: str) -> BedrockTokenManager:
"""Per-process singleton token manager (state is shared on disk)."""
global _TOKEN_MANAGER
if _TOKEN_MANAGER is None or _TOKEN_MANAGER.region != region:
_TOKEN_MANAGER = BedrockTokenManager(region)
return _TOKEN_MANAGER
61 changes: 61 additions & 0 deletions benchmark/benchmark/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
"""Provider/run configuration: stable constants and parent-process env wiring."""

from __future__ import annotations

import os

from .auth import token_manager

BEDROCK_BASE_URL = "https://bedrock-mantle.us-east-2.api.aws/v1"
BEDROCK_REGION = "us-east-2"
PROVIDER = "openai"
MODEL_ID = "nvidia.nemotron-super-3-120b"
SCAN_TIMEOUT_SECONDS = 300

# SkillSpector's own "do not install" verdict == this recommendation string.
MALICIOUS_RECOMMENDATION = "DO_NOT_INSTALL"


def configure_run(no_llm: bool, timeout: float) -> tuple[dict, str, str, str]:
"""Set stable env in the parent (inherited by spawned workers) and return
``(worker_cfg, provider, model, region)``.

Stable, non-secret config (provider/model/base-url) is placed in the parent
``os.environ`` so every spawned worker inherits it. The volatile bedrock
token is NOT set here -- each worker mints its own (see ``scan_worker``) so
it can't go stale across a multi-hour run.
"""
# Inherited by spawned workers (new interpreters read these at startup), so
# SkillSpector/langchain import-time warnings and logs stay out of the bar.
os.environ.setdefault("PYTHONWARNINGS", "ignore")
os.environ.setdefault("SKILLSPECTOR_LOG_LEVEL", "ERROR")
cfg = {"use_llm": not no_llm, "mint_bedrock": False, "region": "", "timeout": timeout}
if no_llm:
return cfg, "", "", ""
if os.environ.get("SKILLSPECTOR_PROVIDER"):
# Caller already configured a provider; trust their env (inherited).
cfg["region"] = os.environ.get("AWS_REGION", "")
return (
cfg,
os.environ["SKILLSPECTOR_PROVIDER"],
os.environ.get("SKILLSPECTOR_MODEL", ""),
cfg["region"],
)
# Default path: OpenAI-compatible Bedrock endpoint with a minted token.
model = os.environ.get("MODEL_OVERRIDE") or os.environ.get("AGENTGUARD_MODEL_ID", MODEL_ID)
region = os.environ.get("AWS_REGION") or BEDROCK_REGION

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we support region runtime overload, we should build the BEDROCK_BASE_URL based on the read region value

manager = token_manager(region)
manager.clear_abort() # drop any stale abort marker from a previous run
try:
manager.token() # fail fast (wait=False) + warm the shared cache
except Exception as e: # noqa: BLE001 - surface a clear, actionable hint
raise SystemExit(
f"Failed to mint a Bedrock token ({type(e).__name__}: {e}).\n"
"Refresh AWS credentials (e.g. `aws sso login`), set an existing "
"SKILLSPECTOR_PROVIDER + key in the env, or pass --no-llm."
) from e
os.environ["SKILLSPECTOR_PROVIDER"] = PROVIDER
os.environ["SKILLSPECTOR_MODEL"] = model
os.environ["OPENAI_BASE_URL"] = BEDROCK_BASE_URL
cfg.update(mint_bedrock=True, region=region)
return cfg, PROVIDER, model, region
33 changes: 33 additions & 0 deletions benchmark/benchmark/dataset_handler/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
"""Dataset handler registry and dispatch.

``discover(root)`` picks the first registered handler that recognizes the tree
and delegates to it. Register additional benchmarks in ``_HANDLERS`` (most
specific first); MalSkillBench is the default fallback.
"""

from __future__ import annotations

from pathlib import Path

from ..models import Unit
from .base import DatasetHandler
from .malskillbench import MalSkillBenchHandler

# Register dataset handlers here, most specific first.
_HANDLERS: list[DatasetHandler] = [MalSkillBenchHandler()]


def get_handler(root: Path) -> DatasetHandler:
"""Return the handler that recognizes ``root`` (defaults to MalSkillBench)."""
for handler in _HANDLERS:
if handler.matches(root):
return handler
return _HANDLERS[0]


def discover(root: Path) -> list[Unit]:
"""Collect all scannable units under ``root`` via the matching handler."""
return get_handler(root).discover(root)


__all__ = ["DatasetHandler", "discover", "get_handler"]
Loading