Ci/asan validation#1209
Conversation
|
|
||
| asan: | ||
| runs-on: ubuntu-latest | ||
| container: ghcr.io/${{ github.repository }}/aws-ofi-nccl-ubuntu2404:cuda-gcc-13-none-efalattest |
There was a problem hiding this comment.
sigh; this is so fragile
There was a problem hiding this comment.
Removed the separate job entirely. ASAN is now enabled on all existing debug builds across all platforms (AL2023, U2204, U2404) and all compilers (gcc and clang). Right. We have another SIM to have just one docker container per OS. That should make the github-actions PR-CI tests more cleaner.
| --enable-picky-compiler=yes \ | ||
| --enable-platform-aws \ | ||
| --with-cuda=/usr/local/cuda \ | ||
| --enable-asan \ |
There was a problem hiding this comment.
is there a reason we're only running this for CUDA? Is there a reason that we're only running this for a specific test?
There was a problem hiding this comment.
No reason. I fixed this. I made the ASAN run on every debug build in the matrix: both CUDA and Neuron, all compilers, all tracing configs.
d54657c to
ef75d7c
Compare
|
bot:aws:retest |
2 similar comments
|
bot:aws:retest |
|
bot:aws:retest |
ef75d7c to
cccee90
Compare
Each test function (no_change_check, addition_check, late_check, replace_check) allocated env arrays and strings via malloc/strdup but never freed them. Add cleanup at the end of each function. This allows running unit tests under ASAN with leak detection enabled and no suppression file. Signed-off-by: Arun Karthik <akkart@amazon.com>
Enable --enable-asan on all debug builds across AL2023, Ubuntu 22.04, and Ubuntu 24.04 for both gcc and clang. ASAN leak detection env vars (detect_leaks=1, halt_on_error=1, exitcode=1) are set on make check so any memory leak in unit tests fails CI. Fix nccl_ofi_memcheck_asan.h to support clang's ASAN detection (__has_feature(address_sanitizer)) in addition to gcc's __SANITIZE_ADDRESS__ macro. Add libasan to the AL2023 container Dockerfile since it is not bundled with gcc on Amazon Linux. Signed-off-by: Arun Karthik <akkart@amazon.com>
The Dockerfile.al2023 change adding libasan only takes effect after update-containers.yaml runs (scheduled weekly or workflow_dispatch). Until then, PR CI pulls the existing image which lacks libasan, and debug builds linked with -fsanitize=address fail at test runtime with 'libasan.so.6: cannot open shared object file'. Install libasan inline for debug builds so this PR's CI can validate ASAN leak detection. The step becomes a no-op once the container image is rebuilt and can be removed in a follow-up. Signed-off-by: Anshuman Goswami <anshumgo@amazon.com>
cccee90 to
c3bfa00
Compare
Description of changes:
Add a new asan job to the PR CI workflow that builds the plugin with --enable-asan --enable-debug on Ubuntu 24.04 (gcc-13) and runs unit tests with ASAN leak detection enabled
(detect_leaks=1, halt_on_error=1, exitcode=1).
Any memory leak in unit test code that is not freed by test exit will cause the CI job to fail. On failure, the test-suite.log is uploaded as an artifact for debugging.
Also fixes memory leaks in the environ unit test — each test function (no_change_check, addition_check, late_check, replace_check) allocated env arrays and strings via malloc/
strdup but never freed them. With these fixes, all 18 unit tests pass under ASAN with zero suppressions.
Changes:
Testing:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.