Skip to content

Read xz/gz-compressed AF3 confidences.json#20

Merged
DimaMolod merged 2 commits into
mainfrom
read-compressed-confidences
Jun 16, 2026
Merged

Read xz/gz-compressed AF3 confidences.json#20
DimaMolod merged 2 commits into
mainfrom
read-compressed-confidences

Conversation

@DimaMolod

Copy link
Copy Markdown
Collaborator

Summary

AlphaPulldown's new --storage_mode slim/minimal (KosinskiLab/AlphaPulldown#627) may store the large AF3 per-sample confidences.json compressed (xz) to save disk. AlphaJudge could only read plain JSON, so for a compressed — or absent — confidences.json it silently fell back to summary_confidences.json, whose PAE is only a coarse per-chain-pair minimum.

On a real complex the two are not interchangeable: the full token×token matrix has mean ≈29 / max 31.7, while the summary-derived matrix has mean ≈3.7 / max 7.08. So every PAE-derived score for that sample (average_interface_pae, interface_ipSAE, pDockQ2, LIS, …) was being computed from the wrong data, with no error or warning.

Changes

  • BaseParser._read_json detects xz/gz by magic bytes (not file extension) and, when the requested plain path is absent, transparently falls back to a .xz/.gz sibling.
  • AF3Parser._find_af3_json now offers .xz/.gz variants of each candidate, and no longer lets summary_confidences.json satisfy a "confidences" search via the *_confidences.json glob — a latent shadowing bug that exists independently of compression (it was masked only while the plain confidences.json happened to be candidate[0]).

Validation

End-to-end on a real plasmodium_hap2 AF3 run processed with --storage_mode slim (non-best confidences.json.xz):

  • Before: scoring in --models_to_analyse all read summary_confidences.json for non-best samples (PAE mean ≈3.5).
  • After: each sample reads its full PAE matrix from the .xz file (max_pae 31.7), and the full score CSV is byte-identical to the uncompressed/vanilla run in both best and all modes.

Tests

6 new tests in test/test_parsers_and_runner.py:

  • _read_json reads plain / .xz / .gz, detects compression by magic bytes (xz under a .json name), falls back to a compressed sibling, and returns {} for a truly-missing file.
  • _find_af3_json finds a compressed confidences.json.xz and does not shadow confidences with summary_confidences.json.

Full parser/runner suite passes (external-data tests skip as before).

🤖 Generated with Claude Code

DimaMolod and others added 2 commits June 15, 2026 18:57
AlphaPulldown's new --storage_mode slim/minimal may store the large AF3
per-sample confidences.json compressed (xz). AlphaJudge previously could
only read plain JSON, so for a compressed (or absent) confidences.json it
silently fell back to summary_confidences.json, whose PAE is a coarse
per-chain-pair minimum (full token x token matrix mean ~29 vs summary ~3.7
on a real complex) — degrading every PAE-derived score for that sample with
no error.

- _read_json now detects xz/gz by magic bytes (not extension) and, when the
  requested plain path is absent, transparently falls back to a .xz/.gz
  sibling.
- _find_af3_json now also offers .xz/.gz variants of each candidate, and no
  longer lets summary_confidences.json satisfy a "confidences" search via the
  *_confidences.json glob (a latent shadowing bug independent of compression).

Verified end to end: AlphaJudge scoring a slim-compressed AF3 run in
--models_to_analyse all mode now reads each sample's full PAE matrix and
produces byte-identical scores to the uncompressed (vanilla) run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The summary-shadowing guard only skipped files literally starting with
"summary_", but the official AF3 / AlphaPulldown best-model layout names the
summary "<job>_summary_confidences.json", which does not start with "summary_".
Match by suffix instead so both "summary_confidences.json" and
"<job>_summary_confidences.json" (optionally .xz/.gz) are excluded from a
"confidences" search and cannot shadow the real confidences file.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@DimaMolod DimaMolod merged commit 19f962c into main Jun 16, 2026
12 checks passed
@DimaMolod DimaMolod deleted the read-compressed-confidences branch June 16, 2026 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant