Calibrate percentile sliders on positives only; drop slider poly-line#15
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 002228d355
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| direct = _safe_float(row.get("interface_meta_score")) | ||
| if direct is not None: | ||
| return direct |
There was a problem hiding this comment.
Recompute the meta score with the new calibration
When the input CSV already contains an interface_meta_score column, this returns that value before recalculating from the new positives-only quantiles. That means aggregate/per-run reports generated from older or externally merged CSVs can still rank interfaces and draw the Meta score marker using the legacy all-rows calibration while the feature sliders use the new calibration, contradicting the report text that all percentiles are against interacting positive pairs. Prefer recomputing here (or validating the calibration version) so stale scores do not silently survive the recalibration.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in dbb818b. _row_meta_score now recomputes the meta score from the current positives-only calibration first, so the Meta marker is consistent with the recalibrated feature sliders. A precomputed interface_meta_score column (which may carry the legacy all-rows calibration on older/merged CSVs) is only used as a fallback when the raw feature columns are absent.
| import numpy as np | ||
| import pandas as pd |
There was a problem hiding this comment.
Declare pandas for the quantile freezer
This new reproducibility script imports pandas, but the project dependencies in pyproject.toml do not declare pandas. In a normal AlphaJudge install that only has the packaged dependencies, running the advertised python scripts/freeze_metascore_quantiles.py ... path will fail immediately with ModuleNotFoundError before it can regenerate the frozen deciles; either add pandas as a dependency/extra or avoid requiring it here.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in dbb818b. The script was rewritten to use csv + numpy only (no pandas), matching the existing analyze_interface_meta_score.py convention, so the advertised command runs in a stock AlphaJudge install. Output is bit-identical to the pandas version (verified < 1e-12).
Rebased onto main 1.1.0 (which shipped the validation-report/meta-score feature). Applies three corrections on top of the current report: - meta_score.py: BENCHMARK_QUANTILES recomputed on POSITIVE (interacting) benchmark pairs only (3,878 AF2/AF3 positive rows) instead of all 7,756 rows (half non-interacting database-negatives). A prediction is now ranked against the distribution of real interfaces, not a decoy-padded population: e.g. interface_ipSAE=0.5 maps to the 49th percentile (was 75th); interface_LIS=0.30 to 46th (was 72nd). Per-feature AUROC is unchanged (monotonic transform); production interface_meta_score AUROC on the balanced benchmark is 0.878. - report.py: removed the black poly-line connecting per-group slider markers (each metric still shows its black percentile marker). _row_meta_score now recomputes the meta score from the current calibration first, falling back to a precomputed interface_meta_score column only when raw features are absent, so the Meta marker stays consistent with the recalibrated sliders. Wording updated to state the scale is built from interacting (positive) pairs. - scripts/freeze_metascore_quantiles.py: reproduces the deciles from any benchmark CSV (--label-filter positive|negative|all; "all" reproduces the prior all-rows scale bit-for-bit). Uses csv+numpy only (no pandas, which is not an AlphaJudge dependency). Tests: test_meta_score.py + test_report.py pass (9/9). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
dbb818b to
db8697f
Compare
Summary
Three corrections to the AlphaJudge interface metascore / percentile slider report, rebased onto
mainafter the 1.1.0 release (which shipped the validation-report + meta-score feature). The diff is just these changes on top of currentmain— no paper/benchmark artifacts.1. Percentile sliders calibrated on POSITIVES ONLY
BENCHMARK_QUANTILES(meta_score.py) was computed over the entire benchmark — 7,756 rows, half non-interacting database-negatives — which inflates the percentile of a mediocre prediction. Recomputed over the 3,878 positive (interacting) AF2/AF3 rows only, so a prediction is ranked against the distribution of real interfaces.interface_ipSAEinterface_LISPer-feature AUROC unchanged (monotonic transform); production
interface_meta_scoreAUROC on the balanced benchmark is 0.878 (AP 0.910).2. Removed the black poly-line connecting the sliders
Each metric still shows its black percentile marker; the connecting line is gone.
3. Meta marker consistency (addresses earlier Codex P2)
_row_meta_scorenow recomputes from the current calibration first, falling back to a precomputedinterface_meta_scorecolumn only when the raw feature columns are absent — so the Meta marker can't silently use a stale all-rows score while the sliders use the new one.Reproducibility
scripts/freeze_metascore_quantiles.pyregenerates the deciles from any benchmark CSV (--label-filter positive|negative|all;allreproduces the prior scale bit-for-bit). Usescsv+numpyonly — no pandas (not an AlphaJudge dependency; addresses the other earlier Codex P2).Report wording updated throughout to state the scale is built from interacting (positive) pairs.
Test plan
pytest test/test_meta_score.py test/test_report.py→ 9/9 pass.py_compileclean; freeze script runs in a stock env and its output is bit-identical tometa_score.py.🤖 Generated with Claude Code