forked from NVIDIA/SkillSpector
-
Notifications
You must be signed in to change notification settings - Fork 0
feat(benchmark): add SkillSpector benchmark harness #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
will-exaforce
wants to merge
1
commit into
main
Choose a base branch
from
wbeasley/add-benchmark
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| # Benchmark project: local venv, caches, and harness output DBs. | ||
| .venv/ | ||
| .ruff_cache/ | ||
| __pycache__/ | ||
| # DuckDB result files written by the harness (benchmark_<id>.duckdb, -o targets). | ||
| *.duckdb |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,128 @@ | ||
| # SkillSpector Benchmark | ||
|
|
||
| A standalone harness that runs SkillSpector over a benchmark dataset | ||
| ([MalSkillBench](https://github.com/NVIDIA/MalSkillBench) today) and scores its | ||
| classifications into DuckDB, so you can measure precision/recall/accuracy. | ||
|
|
||
| ## Why this is a separate project | ||
|
|
||
| This is a **dev/eval tool**, not part of the shipped `skillspector` package. It | ||
| lives in its own uv project with its own `pyproject.toml`, lockfile, and venv, | ||
| and depends on `skillspector` as an **editable path dependency** (`../`). That | ||
| means: | ||
|
|
||
| - Nothing here touches the root `pyproject.toml` / lockfile, so upstream merges | ||
| from `NVIDIA/skillspector` stay conflict-free. | ||
| - The harness-only dependencies (`duckdb`, `filelock`, `tqdm`, | ||
| `aws-bedrock-token-generator`) never ship to `skillspector` end users. | ||
| - Edits to `src/skillspector` are picked up immediately (editable install), so | ||
| you always benchmark your working tree. | ||
|
|
||
| ## Running | ||
|
|
||
| From the repo root (no `cd` needed): | ||
|
|
||
| ```bash | ||
| uv run --directory benchmark benchmark /path/to/MalSkillBench/Dataset | ||
| ``` | ||
|
|
||
| Or from inside this directory: | ||
|
|
||
| ```bash | ||
| cd benchmark | ||
| uv run benchmark /path/to/MalSkillBench/Dataset | ||
| uv run benchmark .../Dataset/Prompts/indirect-injection -o out.duckdb | ||
| uv run benchmark .../Dataset/Skills/malware --limit 50 --workers 8 | ||
| uv run benchmark .../Dataset --no-llm # static analysis only | ||
| ``` | ||
|
|
||
| > Note: the console-script name (`benchmark`) intentionally matches the package | ||
| > directory, so there is **no** `__main__.py` — adding one would make `uv run | ||
| > benchmark` execute the directory instead of the installed entry point. Invoke | ||
| > it via the `benchmark` command (as above) or `.venv/bin/benchmark` directly. | ||
|
|
||
| ### Key flags | ||
|
|
||
| | Flag | Meaning | | ||
| |------|---------| | ||
| | `-o, --output` | DuckDB output file (default `benchmark_<id>.duckdb`) | | ||
| | `--no-llm` | static analysis only (no LLM calls) | | ||
| | `--categories` | comma-separated unit categories to scan; default `skill,code`. **Prompts are excluded by default** — they're raw prompt-injection samples, not the repo config files this tool classifies (PI embedded in a skill is still covered via the `skill` units). Pass `--categories skill,code,prompt` to include them. | | ||
| | `--limit N` | cap units **per group** — a unit's parent directory (e.g. `--limit 20` on `Dataset/Skills` keeps 20 from `Skills/malware` *and* 20 from `Skills/benign`); 0 = no cap | | ||
| | `--workers N` | concurrent scan processes (default 8) | | ||
| | `--overwrite` | start fresh instead of resuming an existing DB | | ||
| | `--auth-wait-seconds` | how long to pause for `aws sso login` on a mid-run SSO expiry | | ||
|
|
||
| A run is **resumable**: re-running the same command against an existing output | ||
| DB skips already-classified units and re-scans only failures. | ||
|
|
||
| ## Output | ||
|
|
||
| DuckDB tables `runs`, `units` (ground truth), `classifications` (SkillSpector | ||
| verdict), `issues`, `components`, plus an `evaluation` view labeling every scan | ||
| `TP`/`FP`/`TN`/`FN`/`ERROR`: | ||
|
|
||
| ```bash | ||
| duckdb benchmark_<id>.duckdb "SELECT outcome, count(*) FROM evaluation GROUP BY 1" | ||
| ``` | ||
|
|
||
| ## Inspecting classifier performance | ||
|
|
||
| The `queries/` directory holds ready-made DuckDB queries for evaluating how well | ||
| SkillSpector classified — precision/recall/F1 breakdowns, the false-negative and | ||
| false-positive lists worth eyeballing, run-health checks, and tuning aids. Run | ||
| one against an output DB with `.read`: | ||
|
|
||
| ```bash | ||
| duckdb -readonly benchmark_<id>.duckdb ".read queries/01_overview.sql" | ||
| ``` | ||
|
|
||
| Each file is self-documenting (header comment explains what it shows) and, except | ||
| `01_overview.sql`, defaults to the **most recent run** in the DB — edit the `run` | ||
| CTE at the top to target a different run or span all of them. | ||
|
|
||
| | Query | What it answers | | ||
| |-------|-----------------| | ||
| | `01_overview.sql` | Per-run config + confusion matrix + precision/recall/F1/accuracy (compare runs) | | ||
| | `02_metrics_by_category.sql` | Metrics split by skill / code / prompt | | ||
| | `03_metrics_by_attack_vector.sql` | Metrics split by CI / PI / MIXED | | ||
| | `04_metrics_by_corpus.sql` | Metrics split by source corpus | | ||
| | `05_recall_by_behavior.sql` | Which attack behaviors (B1..B15) evade detection, worst first | | ||
| | `06_false_negatives.sql` | Malware that was cleared — the critical misses | | ||
| | `07_false_positives.sql` | Benign units flagged, with the rules that fired | | ||
| | `08_errors.sql` | Scans that failed (invalidate metrics if high) | | ||
| | `09_classification_status.sql` | LLM vs static vs **fallback** (catches "LLM didn't actually run") | | ||
| | `10_risk_score_distribution.sql` | Score histogram by ground truth — does the score separate classes? | | ||
| | `11_threshold_sweep.sql` | P/R/F1 across hypothetical `risk_score` cutoffs (tune the verdict) | | ||
| | `12_scan_timing.sql` | Timing percentiles + slowest scans | | ||
| | `13_label_coverage.sql` | How ground-truth labels were resolved (bounds trust in 03/05) | | ||
| | `14_top_rules.sql` | Which rules fire, on malicious vs benign (false-positive drivers) | | ||
|
|
||
| ## Layout | ||
|
|
||
| ``` | ||
| benchmark/ | ||
| main.py # argparse + run orchestration | ||
| config.py # provider/run constants + parent-process env wiring | ||
| models.py # Unit, ScanResult | ||
| auth.py # Bedrock bearer-token manager (cross-process cache) | ||
| db.py # DuckDB schema + result persistence | ||
| runner.py # the classifier seam: prepare a unit, run SkillSpector | ||
| utils.py # cross-cutting helpers | ||
| dataset_handler/ | ||
| base.py # DatasetHandler abstraction | ||
| malskillbench.py # MalSkillBench implementation | ||
| __init__.py # handler registry + discover() dispatch | ||
| ``` | ||
|
|
||
| ## Adding another dataset | ||
|
|
||
| Discovery is behind an interface so the harness isn't locked to MalSkillBench: | ||
|
|
||
| 1. Subclass `DatasetHandler` (in `dataset_handler/base.py`), implementing | ||
| `matches(root)` and `discover(root) -> list[Unit]`. | ||
| 2. Register it in `dataset_handler/__init__.py`'s `_HANDLERS` list (most | ||
| specific first; MalSkillBench is the default fallback). | ||
|
|
||
| Everything downstream (runner, DB, scoring) works in terms of `Unit` / | ||
| `ScanResult`, so a new handler is the only code a new benchmark needs. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| """SkillSpector benchmark harness. | ||
|
|
||
| Runs SkillSpector over a benchmark dataset (MalSkillBench today) and scores its | ||
| classifications into DuckDB. See ``main.py`` for the CLI entrypoint. | ||
| """ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,161 @@ | ||
| """Bedrock bearer-token management, shared across worker processes. | ||
|
|
||
| A bedrock bearer token expires (~12h); a long run outlasts it. Workers can't | ||
| share an AWS session object (botocore credentials don't survive ``spawn``), but | ||
| they CAN share the minted token string through a small 0600 cache file. The | ||
| first worker to find it missing/stale re-mints under a file lock (double- | ||
| checked) and writes it; everyone else just reads the cached value. botocore | ||
| itself already shares/refreshes the underlying AWS credentials cross-process | ||
| via its own on-disk cache, so this layer only dedupes the token signing. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import json | ||
| import os | ||
| import pathlib | ||
| import sys | ||
| import tempfile | ||
| import time | ||
| from datetime import UTC, datetime, timedelta | ||
|
|
||
| _TOKEN_TTL = timedelta(hours=12) | ||
| _TOKEN_REFRESH_MARGIN = timedelta(minutes=30) | ||
| _AUTH_RETRY_INTERVAL = 15 # seconds between re-mint attempts while paused | ||
|
|
||
|
|
||
| class AuthAbortedError(Exception): | ||
| """AWS credentials could not be restored within the pause window; give up.""" | ||
|
|
||
|
|
||
| class BedrockTokenManager: | ||
| """Cross-process cache for a Bedrock bearer token (file + file lock). | ||
|
|
||
| Re-minting near expiry resolves AWS credentials fresh, so the common case -- | ||
| short-lived assume-role/STS creds aging out while the SSO *session* is still | ||
| valid -- self-heals (botocore refreshes via STS). When the SSO *session* | ||
| itself has expired, minting needs a human ``aws sso login``; ``token()`` | ||
| then pauses (holding the lock so peers queue), retries until creds return, | ||
| and -- if the wait window is exceeded -- drops an abort marker so every | ||
| worker fails fast instead of each blocking for the full window. | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| region: str, | ||
| cache_path: pathlib.Path | None = None, | ||
| ttl: timedelta = _TOKEN_TTL, | ||
| refresh_margin: timedelta = _TOKEN_REFRESH_MARGIN, | ||
| ) -> None: | ||
| self.region = region | ||
| self.ttl = ttl | ||
| self.refresh_margin = refresh_margin | ||
| self.auth_wait = float(os.environ.get("SKILLSPECTOR_BENCH_AUTH_WAIT", "1800")) | ||
| default = pathlib.Path(tempfile.gettempdir()) / f"skillspector_bench_token_{region}.json" | ||
| self.cache_path = pathlib.Path( | ||
| os.environ.get("SKILLSPECTOR_BENCH_TOKEN_CACHE", cache_path or default) | ||
| ) | ||
| self.lock_path = self.cache_path.with_suffix(".lock") | ||
| self.abort_path = self.cache_path.with_suffix(".abort") | ||
| self._mem: tuple[str, datetime] | None = None # in-process fast path | ||
|
|
||
| def token(self, wait: bool = False) -> str: | ||
| """Return a valid token, refreshing the shared cache if stale. | ||
|
|
||
| ``wait=True`` (workers) pauses and retries on an auth failure so a | ||
| mid-run SSO expiry can be fixed with ``aws sso login`` without losing | ||
| progress. ``wait=False`` (parent startup) fails fast so a bad-creds run | ||
| aborts immediately instead of hanging. | ||
| """ | ||
| if self.abort_path.exists(): | ||
| raise AuthAbortedError("credential refresh already abandoned this run") | ||
| now = datetime.now(UTC) | ||
| if self._mem and now < self._mem[1]: | ||
| return self._mem[0] | ||
| cached = self._read() | ||
| if cached and now < cached[1]: | ||
| self._mem = cached | ||
| return cached[0] | ||
| # Missing/stale: exactly one process refreshes; others queue on the lock. | ||
| from filelock import FileLock | ||
|
|
||
| with FileLock(str(self.lock_path)): | ||
| if self.abort_path.exists(): | ||
| raise AuthAbortedError("credential refresh already abandoned this run") | ||
| cached = self._read() | ||
| if cached and datetime.now(UTC) < cached[1]: | ||
| self._mem = cached | ||
| return cached[0] | ||
| return self._mint_locked(wait) | ||
|
|
||
| def _mint_locked(self, wait: bool) -> str: | ||
| """Mint under the held lock; pause-retry on auth failure when waiting.""" | ||
| deadline = time.monotonic() + self.auth_wait | ||
| announced = False | ||
| while True: | ||
| try: | ||
| fresh = self._mint() | ||
| except Exception as e: # noqa: BLE001 - any mint failure is retryable | ||
| if not wait: | ||
| raise # startup: surface immediately to the caller | ||
| if time.monotonic() >= deadline: | ||
| self.abort_path.touch() # signal all workers to stop fast | ||
| raise AuthAbortedError( | ||
| f"AWS credentials not restored within {int(self.auth_wait)}s ({e})" | ||
| ) from e | ||
| if not announced: | ||
| print( | ||
| f"\n[paused] Bedrock token mint failed: {e}\n" | ||
| " Run `aws sso login` to restore credentials; the run will " | ||
| f"resume automatically.\n Retrying every {_AUTH_RETRY_INTERVAL}s " | ||
| f"(giving up in {int(self.auth_wait / 60)}m).", | ||
| file=sys.stderr, | ||
| flush=True, | ||
| ) | ||
| announced = True | ||
| time.sleep(_AUTH_RETRY_INTERVAL) | ||
| continue | ||
| if announced: | ||
| print("[resumed] AWS credentials restored.", file=sys.stderr, flush=True) | ||
| self._write(*fresh) | ||
| self._mem = fresh | ||
| return fresh[0] | ||
|
|
||
| def _mint(self) -> tuple[str, datetime]: | ||
| from aws_bedrock_token_generator import provide_token | ||
|
|
||
| token = provide_token(region=self.region, expiry=self.ttl) | ||
| # Refresh a margin before the real 12h expiry for safety. | ||
| return token, datetime.now(UTC) + self.ttl - self.refresh_margin | ||
|
|
||
| def clear_abort(self) -> None: | ||
| """Remove any stale abort marker from a previous run.""" | ||
| self.abort_path.unlink(missing_ok=True) | ||
|
|
||
| def _read(self) -> tuple[str, datetime] | None: | ||
| try: | ||
| data = json.loads(self.cache_path.read_text(encoding="utf-8")) | ||
| return data["token"], datetime.fromisoformat(data["expires_at"]) | ||
| except (OSError, json.JSONDecodeError, KeyError, ValueError): | ||
| return None | ||
|
|
||
| def _write(self, token: str, expires_at: datetime) -> None: | ||
| self.cache_path.parent.mkdir(parents=True, exist_ok=True) | ||
| tmp = self.cache_path.with_suffix(".tmp") | ||
| tmp.write_text( | ||
| json.dumps({"token": token, "expires_at": expires_at.isoformat()}), | ||
| encoding="utf-8", | ||
| ) | ||
| os.chmod(tmp, 0o600) | ||
| os.replace(tmp, self.cache_path) # atomic publish | ||
|
|
||
|
|
||
| _TOKEN_MANAGER: BedrockTokenManager | None = None | ||
|
|
||
|
|
||
| def token_manager(region: str) -> BedrockTokenManager: | ||
| """Per-process singleton token manager (state is shared on disk).""" | ||
| global _TOKEN_MANAGER | ||
| if _TOKEN_MANAGER is None or _TOKEN_MANAGER.region != region: | ||
| _TOKEN_MANAGER = BedrockTokenManager(region) | ||
| return _TOKEN_MANAGER |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| """Provider/run configuration: stable constants and parent-process env wiring.""" | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import os | ||
|
|
||
| from .auth import token_manager | ||
|
|
||
| BEDROCK_BASE_URL = "https://bedrock-mantle.us-east-2.api.aws/v1" | ||
| BEDROCK_REGION = "us-east-2" | ||
| PROVIDER = "openai" | ||
| MODEL_ID = "nvidia.nemotron-super-3-120b" | ||
| SCAN_TIMEOUT_SECONDS = 300 | ||
|
|
||
| # SkillSpector's own "do not install" verdict == this recommendation string. | ||
| MALICIOUS_RECOMMENDATION = "DO_NOT_INSTALL" | ||
|
|
||
|
|
||
| def configure_run(no_llm: bool, timeout: float) -> tuple[dict, str, str, str]: | ||
| """Set stable env in the parent (inherited by spawned workers) and return | ||
| ``(worker_cfg, provider, model, region)``. | ||
|
|
||
| Stable, non-secret config (provider/model/base-url) is placed in the parent | ||
| ``os.environ`` so every spawned worker inherits it. The volatile bedrock | ||
| token is NOT set here -- each worker mints its own (see ``scan_worker``) so | ||
| it can't go stale across a multi-hour run. | ||
| """ | ||
| # Inherited by spawned workers (new interpreters read these at startup), so | ||
| # SkillSpector/langchain import-time warnings and logs stay out of the bar. | ||
| os.environ.setdefault("PYTHONWARNINGS", "ignore") | ||
| os.environ.setdefault("SKILLSPECTOR_LOG_LEVEL", "ERROR") | ||
| cfg = {"use_llm": not no_llm, "mint_bedrock": False, "region": "", "timeout": timeout} | ||
| if no_llm: | ||
| return cfg, "", "", "" | ||
| if os.environ.get("SKILLSPECTOR_PROVIDER"): | ||
| # Caller already configured a provider; trust their env (inherited). | ||
| cfg["region"] = os.environ.get("AWS_REGION", "") | ||
| return ( | ||
| cfg, | ||
| os.environ["SKILLSPECTOR_PROVIDER"], | ||
| os.environ.get("SKILLSPECTOR_MODEL", ""), | ||
| cfg["region"], | ||
| ) | ||
| # Default path: OpenAI-compatible Bedrock endpoint with a minted token. | ||
| model = os.environ.get("MODEL_OVERRIDE") or os.environ.get("AGENTGUARD_MODEL_ID", MODEL_ID) | ||
| region = os.environ.get("AWS_REGION") or BEDROCK_REGION | ||
| manager = token_manager(region) | ||
| manager.clear_abort() # drop any stale abort marker from a previous run | ||
| try: | ||
| manager.token() # fail fast (wait=False) + warm the shared cache | ||
| except Exception as e: # noqa: BLE001 - surface a clear, actionable hint | ||
| raise SystemExit( | ||
| f"Failed to mint a Bedrock token ({type(e).__name__}: {e}).\n" | ||
| "Refresh AWS credentials (e.g. `aws sso login`), set an existing " | ||
| "SKILLSPECTOR_PROVIDER + key in the env, or pass --no-llm." | ||
| ) from e | ||
| os.environ["SKILLSPECTOR_PROVIDER"] = PROVIDER | ||
| os.environ["SKILLSPECTOR_MODEL"] = model | ||
| os.environ["OPENAI_BASE_URL"] = BEDROCK_BASE_URL | ||
| cfg.update(mint_bedrock=True, region=region) | ||
| return cfg, PROVIDER, model, region | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| """Dataset handler registry and dispatch. | ||
|
|
||
| ``discover(root)`` picks the first registered handler that recognizes the tree | ||
| and delegates to it. Register additional benchmarks in ``_HANDLERS`` (most | ||
| specific first); MalSkillBench is the default fallback. | ||
| """ | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| from pathlib import Path | ||
|
|
||
| from ..models import Unit | ||
| from .base import DatasetHandler | ||
| from .malskillbench import MalSkillBenchHandler | ||
|
|
||
| # Register dataset handlers here, most specific first. | ||
| _HANDLERS: list[DatasetHandler] = [MalSkillBenchHandler()] | ||
|
|
||
|
|
||
| def get_handler(root: Path) -> DatasetHandler: | ||
| """Return the handler that recognizes ``root`` (defaults to MalSkillBench).""" | ||
| for handler in _HANDLERS: | ||
| if handler.matches(root): | ||
| return handler | ||
| return _HANDLERS[0] | ||
|
|
||
|
|
||
| def discover(root: Path) -> list[Unit]: | ||
| """Collect all scannable units under ``root`` via the matching handler.""" | ||
| return get_handler(root).discover(root) | ||
|
|
||
|
|
||
| __all__ = ["DatasetHandler", "discover", "get_handler"] |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we support region runtime overload, we should build the
BEDROCK_BASE_URLbased on the read region value