Capture SWE-bench patches from failed runs#751
Conversation
75930f2 to
127766b
Compare
|
@OpenHands /review Please run an AI review for this now-main-based generic failed-run patch capture PR. CI is green and Apptainer-specific changes have been split into #745. |
|
Uh oh! There was an unexpected error starting the job :( |
|
@OpenHands /review |
|
I'm on it! neubig can track my progress at all-hands.dev |
Code ReviewOverall: ✅ LGTMThe PR is well-designed and implements a clean solution for capturing git patches from failed SWE-bench runs. What this PR does:
Strengths:
Verified:
Minor notes (not blockers):
This is a solid implementation that cleanly solves the problem of capturing partial outputs from failed runs. |
|
OpenHands encountered an error: **Failed to send message to agent server: HTTP 503 error: no available server See the conversation for more information. |
|
@OpenHands /review |
|
Uh oh! There was an unexpected error starting the job :( |
|
@OpenHands /review |
|
I'm on it! neubig can track my progress at all-hands.dev |
Code ReviewOverall: ✅ LGTMThe PR is well-designed and implements a clean solution for capturing git patches from failed SWE-bench runs. The tests pass and pre-commit checks are clean. What this PR does:
Strengths:
Files changed:
Test results:Recommendation:Approve - the implementation is clean, well-tested, and follows the existing codebase patterns. Ready for merge when the draft status is lifted. |
|
OpenHands encountered an error: **Failed to send message to agent server: HTTP 503 error: no available server See the conversation for more information. |
Resolve conflict in benchmarks/swebench/run_infer.py: keep both the PR's failed-run patch capture methods (get_repo_path, get_git_patch, get_staged_git_patch, collect_failure_test_result) and main's new Apptainer helper methods (get_apptainer_extra_bind_mounts, get_apptainer_mount_dir). The two additions are independent and non-overlapping. Co-authored-by: openhands <openhands@all-hands.dev>
Summary
test_resulton SWE-bench error outputs so failed runs can still be scored when they produced a patchgit diff --cached <base_commit>because failed runs do not commit and therefore do not advanceHEADScope
This PR is workspace-backend agnostic. It uses the existing
RemoteWorkspace.execute_command()API and applies to Docker, Apptainer, and other remote workspaces where/workspace/<repo>is a git repo.Apptainer-specific workspace/image changes are kept out of this PR and live in #745.
Verification
PYTHONPATH="$PWD" /home/gneubig/work/openhands-benchmarks-venv/bin/python -m pytest tests/test_swebench_failure_patch_capture.py -qDraft while the full ADP SWE-bench experiment continues running.
Issue
Related: #749
Closes #758