Capture SWE-bench patches from failed runs by neubig · Pull Request #751 · OpenHands/benchmarks

neubig · 2026-06-12T11:13:00Z

Summary

preserve test_result on SWE-bench error outputs so failed runs can still be scored when they produced a patch
stage failed SWE-bench workspaces before collecting failure patches, so newly-created files are included
collect failed-run patches from the staged index with git diff --cached <base_commit> because failed runs do not commit and therefore do not advance HEAD

Scope

This PR is workspace-backend agnostic. It uses the existing RemoteWorkspace.execute_command() API and applies to Docker, Apptainer, and other remote workspaces where /workspace/<repo> is a git repo.

Apptainer-specific workspace/image changes are kept out of this PR and live in #745.

Verification

PYTHONPATH="$PWD" /home/gneubig/work/openhands-benchmarks-venv/bin/python -m pytest tests/test_swebench_failure_patch_capture.py -q
ADP partial scoring smoke: a captured error patch from a stuck run was applied and scored successfully by the SWE-bench scorer.

Draft while the full ADP SWE-bench experiment continues running.

Issue

Related: #749

Closes #758

neubig · 2026-06-12T13:08:41Z

@OpenHands /review

Please run an AI review for this now-main-based generic failed-run patch capture PR. CI is green and Apptainer-specific changes have been split into #745.

openhands-ai · 2026-06-12T13:09:25Z

Uh oh! There was an unexpected error starting the job :(

neubig · 2026-06-12T13:10:25Z

@OpenHands /review

openhands-ai · 2026-06-12T13:10:50Z

I'm on it! neubig can track my progress at all-hands.dev

neubig · 2026-06-12T13:15:09Z

Code Review

Overall: ✅ LGTM

The PR is well-designed and implements a clean solution for capturing git patches from failed SWE-bench runs.

What this PR does:

Adds a collect_failure_test_result() hook to the base Evaluation class that benchmarks can override
SWE-bench implementation stages the workspace with git add -A and captures a patch with git diff --cached
Preserves test_result in error outputs so failed runs can still be scored
Properly handles exceptions during patch collection with warnings

Strengths:

Clean architecture: The hook is in the base class, making it reusable across benchmarks
Proper error handling: Both git add and git diff failures are handled gracefully with warnings
Good separation of concerns: get_repo_path(), get_git_patch(), get_staged_git_patch(), and collect_failure_test_result() are all separate, focused methods
Defensive programming: Uses dict(test_result or {}) to avoid issues with None values
Comprehensive tests: Two tests covering both base class preservation and SWE-bench-specific behavior
Documentation: Good comments explaining the intent

Verified:

✅ Pre-commit checks pass (Ruff format, Ruff lint, PEP8, Pyright)
✅ Tests pass (2/2)

Minor notes (not blockers):

The unused error parameter in the base collect_failure_test_result hook is fine since it's meant to be overridden
The cast(RemoteWorkspace, workspace) in tests is necessary for type checking

This is a solid implementation that cleanly solves the problem of capturing partial outputs from failed runs.

openhands-ai · 2026-06-12T13:17:17Z

OpenHands encountered an error: **Failed to send message to agent server: HTTP 503 error: no available server
**

See the conversation for more information.

neubig · 2026-06-12T13:27:18Z

@OpenHands /review

openhands-ai · 2026-06-12T13:28:01Z

Uh oh! There was an unexpected error starting the job :(

neubig · 2026-06-12T14:50:39Z

@OpenHands /review

openhands-ai · 2026-06-12T14:51:04Z

I'm on it! neubig can track my progress at all-hands.dev

neubig · 2026-06-12T14:56:54Z

Code Review

Overall: ✅ LGTM

The PR is well-designed and implements a clean solution for capturing git patches from failed SWE-bench runs. The tests pass and pre-commit checks are clean.

What this PR does:

Adds a collect_failure_test_result() hook to the base Evaluation class that benchmarks can override
SWE-bench implementation stages the workspace with git add -A and captures a patch with git diff --cached
Preserves test_result in error outputs so failed runs can still be scored
Properly handles exceptions during patch collection with appropriate warnings

Strengths:

Clean hook design: The collect_failure_test_result() hook in Evaluation is a minimal, well-documented interface that allows other benchmarks to add their own failure recovery logic
Workspace-agnostic: Uses the existing RemoteWorkspace.execute_command() API, making it applicable to Docker, Apptainer, and other backends
Preserves workspace state: git add -A stages changes without committing, keeping the failed workspace in its final agent-visible state
Comprehensive error handling: Both git add failures and patch collection errors are caught and logged as warnings, never blocking evaluation

Files changed:

benchmarks/utils/evaluation.py: +34 lines for the hook and integration
benchmarks/swebench/run_infer.py: +84 lines for SWE-bench specific implementation
tests/test_swebench_failure_patch_capture.py: +94 lines for comprehensive test coverage

Test results:

tests/test_swebench_failure_patch_capture.py::test_error_output_preserves_failure_test_result PASSED
tests/test_swebench_failure_patch_capture.py::test_swebench_collect_failure_test_result_gets_git_patch PASSED

Recommendation:

Approve - the implementation is clean, well-tested, and follows the existing codebase patterns. Ready for merge when the draft status is lifted.

openhands-ai · 2026-06-12T14:58:05Z

OpenHands encountered an error: **Failed to send message to agent server: HTTP 503 error: no available server
**

See the conversation for more information.

Resolve conflict in benchmarks/swebench/run_infer.py: keep both the PR's failed-run patch capture methods (get_repo_path, get_git_patch, get_staged_git_patch, collect_failure_test_result) and main's new Apptainer helper methods (get_apptainer_extra_bind_mounts, get_apptainer_mount_dir). The two additions are independent and non-overlapping. Co-authored-by: openhands <openhands@all-hands.dev>

neubig added 2 commits June 10, 2026 17:31

Capture SWE-bench patches on failed runs

ff53789

Fix SWE-bench failure patch test typing

9ca3af7

neubig changed the title ~~Capture SWE-bench patches on failed runs~~ Capture SWE-bench failure patches and bind Apptainer workspaces Jun 12, 2026

Fix SWE-bench failure patch diffs

127766b

neubig force-pushed the capture-swebench-error-patches branch from 75930f2 to 127766b Compare June 12, 2026 13:00

neubig changed the title ~~Capture SWE-bench failure patches and bind Apptainer workspaces~~ Capture SWE-bench patches from failed runs Jun 12, 2026

neubig mentioned this pull request Jun 12, 2026

Add SWE-bench Apptainer image builds and runtime wiring #745

Merged

Format failed-run patch diff command

ca7715e

This was referenced Jun 23, 2026

Capture SWE-bench patches from failed runs #756

Closed

Track PR #748: Capture SWE-bench patches on failed runs #749

Closed

Capture SWE-bench patches on failed runs #748

Closed

Capture SWE-bench patches from failed runs for scoring #758

Closed

neubig marked this pull request as ready for review June 25, 2026 19:12

neubig merged commit 4e5469e into main Jun 25, 2026
2 checks passed

neubig deleted the capture-swebench-error-patches branch June 25, 2026 19:12

Uh oh!

Conversation

neubig commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope

Verification

Issue

Uh oh!

neubig commented Jun 12, 2026

Uh oh!

openhands-ai Bot commented Jun 12, 2026

Uh oh!

neubig commented Jun 12, 2026

Uh oh!

openhands-ai Bot commented Jun 12, 2026

Uh oh!

neubig commented Jun 12, 2026

Code Review

Overall: ✅ LGTM

What this PR does:

Strengths:

Verified:

Minor notes (not blockers):

Uh oh!

openhands-ai Bot commented Jun 12, 2026

Uh oh!

neubig commented Jun 12, 2026

Uh oh!

openhands-ai Bot commented Jun 12, 2026

Uh oh!

neubig commented Jun 12, 2026

Uh oh!

openhands-ai Bot commented Jun 12, 2026

Uh oh!

neubig commented Jun 12, 2026

Code Review

Overall: ✅ LGTM

What this PR does:

Strengths:

Files changed:

Test results:

Recommendation:

Uh oh!

openhands-ai Bot commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

neubig commented Jun 12, 2026 •

edited

Loading