Skip to content

Calibrate percentile sliders on positives only; drop slider poly-line#15

Merged
DimaMolod merged 1 commit into
mainfrom
percentile-positives-only
Jun 10, 2026
Merged

Calibrate percentile sliders on positives only; drop slider poly-line#15
DimaMolod merged 1 commit into
mainfrom
percentile-positives-only

Conversation

@DimaMolod

@DimaMolod DimaMolod commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

Three corrections to the AlphaJudge interface metascore / percentile slider report, rebased onto main after the 1.1.0 release (which shipped the validation-report + meta-score feature). The diff is just these changes on top of current main — no paper/benchmark artifacts.

1. Percentile sliders calibrated on POSITIVES ONLY

BENCHMARK_QUANTILES (meta_score.py) was computed over the entire benchmark — 7,756 rows, half non-interacting database-negatives — which inflates the percentile of a mediocre prediction. Recomputed over the 3,878 positive (interacting) AF2/AF3 rows only, so a prediction is ranked against the distribution of real interfaces.

metric value old (all rows) new (positives)
interface_ipSAE 0.50 74.5th 48.9th
interface_LIS 0.30 72.2nd 46.3rd

Per-feature AUROC unchanged (monotonic transform); production interface_meta_score AUROC on the balanced benchmark is 0.878 (AP 0.910).

2. Removed the black poly-line connecting the sliders

Each metric still shows its black percentile marker; the connecting line is gone.

3. Meta marker consistency (addresses earlier Codex P2)

_row_meta_score now recomputes from the current calibration first, falling back to a precomputed interface_meta_score column only when the raw feature columns are absent — so the Meta marker can't silently use a stale all-rows score while the sliders use the new one.

Reproducibility

scripts/freeze_metascore_quantiles.py regenerates the deciles from any benchmark CSV (--label-filter positive|negative|all; all reproduces the prior scale bit-for-bit). Uses csv + numpy only — no pandas (not an AlphaJudge dependency; addresses the other earlier Codex P2).

Report wording updated throughout to state the scale is built from interacting (positive) pairs.

Test plan

  • pytest test/test_meta_score.py test/test_report.py → 9/9 pass.
  • py_compile clean; freeze script runs in a stock env and its output is bit-identical to meta_score.py.
  • Rendered the slider panel: poly-line absent, markers present, legend reads "Percentile vs interacting (positive) benchmark pairs".

🤖 Generated with Claude Code

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 002228d355

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/alphajudge/report.py Outdated
Comment on lines +177 to +179
direct = _safe_float(row.get("interface_meta_score"))
if direct is not None:
return direct

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recompute the meta score with the new calibration

When the input CSV already contains an interface_meta_score column, this returns that value before recalculating from the new positives-only quantiles. That means aggregate/per-run reports generated from older or externally merged CSVs can still rank interfaces and draw the Meta score marker using the legacy all-rows calibration while the feature sliders use the new calibration, contradicting the report text that all percentiles are against interacting positive pairs. Prefer recomputing here (or validating the calibration version) so stale scores do not silently survive the recalibration.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in dbb818b. _row_meta_score now recomputes the meta score from the current positives-only calibration first, so the Meta marker is consistent with the recalibrated feature sliders. A precomputed interface_meta_score column (which may carry the legacy all-rows calibration on older/merged CSVs) is only used as a fallback when the raw feature columns are absent.

Comment thread scripts/freeze_metascore_quantiles.py Outdated
Comment on lines +30 to +31
import numpy as np
import pandas as pd

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Declare pandas for the quantile freezer

This new reproducibility script imports pandas, but the project dependencies in pyproject.toml do not declare pandas. In a normal AlphaJudge install that only has the packaged dependencies, running the advertised python scripts/freeze_metascore_quantiles.py ... path will fail immediately with ModuleNotFoundError before it can regenerate the frozen deciles; either add pandas as a dependency/extra or avoid requiring it here.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in dbb818b. The script was rewritten to use csv + numpy only (no pandas), matching the existing analyze_interface_meta_score.py convention, so the advertised command runs in a stock AlphaJudge install. Output is bit-identical to the pandas version (verified < 1e-12).

Rebased onto main 1.1.0 (which shipped the validation-report/meta-score
feature). Applies three corrections on top of the current report:

- meta_score.py: BENCHMARK_QUANTILES recomputed on POSITIVE (interacting)
  benchmark pairs only (3,878 AF2/AF3 positive rows) instead of all 7,756 rows
  (half non-interacting database-negatives). A prediction is now ranked against
  the distribution of real interfaces, not a decoy-padded population: e.g.
  interface_ipSAE=0.5 maps to the 49th percentile (was 75th); interface_LIS=0.30
  to 46th (was 72nd). Per-feature AUROC is unchanged (monotonic transform);
  production interface_meta_score AUROC on the balanced benchmark is 0.878.

- report.py: removed the black poly-line connecting per-group slider markers
  (each metric still shows its black percentile marker). _row_meta_score now
  recomputes the meta score from the current calibration first, falling back to
  a precomputed interface_meta_score column only when raw features are absent,
  so the Meta marker stays consistent with the recalibrated sliders. Wording
  updated to state the scale is built from interacting (positive) pairs.

- scripts/freeze_metascore_quantiles.py: reproduces the deciles from any
  benchmark CSV (--label-filter positive|negative|all; "all" reproduces the
  prior all-rows scale bit-for-bit). Uses csv+numpy only (no pandas, which is
  not an AlphaJudge dependency).

Tests: test_meta_score.py + test_report.py pass (9/9).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@DimaMolod DimaMolod force-pushed the percentile-positives-only branch from dbb818b to db8697f Compare June 10, 2026 07:44
@DimaMolod DimaMolod changed the title Percentile sliders: calibrate on positives only; drop slider poly-line Calibrate percentile sliders on positives only; drop slider poly-line Jun 10, 2026
@DimaMolod DimaMolod merged commit 45dc059 into main Jun 10, 2026
8 checks passed
@DimaMolod DimaMolod deleted the percentile-positives-only branch June 10, 2026 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant