Skip to content

Ci/asan validation#1209

Open
akkart-aws wants to merge 3 commits intoaws:masterfrom
akkart-aws:ci/asan_validation
Open

Ci/asan validation#1209
akkart-aws wants to merge 3 commits intoaws:masterfrom
akkart-aws:ci/asan_validation

Conversation

@akkart-aws
Copy link
Copy Markdown
Contributor

Description of changes:

Add a new asan job to the PR CI workflow that builds the plugin with --enable-asan --enable-debug on Ubuntu 24.04 (gcc-13) and runs unit tests with ASAN leak detection enabled
(detect_leaks=1, halt_on_error=1, exitcode=1).

Any memory leak in unit test code that is not freed by test exit will cause the CI job to fail. On failure, the test-suite.log is uploaded as an artifact for debugging.

Also fixes memory leaks in the environ unit test — each test function (no_change_check, addition_check, late_check, replace_check) allocated env arrays and strings via malloc/
strdup but never freed them. With these fixes, all 18 unit tests pass under ASAN with zero suppressions.

Changes:

  • .github/workflows/distcheck.yaml: new asan job
  • tests/unit/environ.cpp: add cleanup to all test functions

Testing:

  • Built with --enable-asan --enable-debug, ran make check with leak detection — 17 pass, 1 skip (gdrcopy), 0 fail, no suppressions needed.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@akkart-aws akkart-aws requested a review from a team as a code owner April 27, 2026 20:48
Comment thread .github/workflows/distcheck.yaml Outdated

asan:
runs-on: ubuntu-latest
container: ghcr.io/${{ github.repository }}/aws-ofi-nccl-ubuntu2404:cuda-gcc-13-none-efalattest
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sigh; this is so fragile

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the separate job entirely. ASAN is now enabled on all existing debug builds across all platforms (AL2023, U2204, U2404) and all compilers (gcc and clang). Right. We have another SIM to have just one docker container per OS. That should make the github-actions PR-CI tests more cleaner.

Comment thread .github/workflows/distcheck.yaml Outdated
--enable-picky-compiler=yes \
--enable-platform-aws \
--with-cuda=/usr/local/cuda \
--enable-asan \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason we're only running this for CUDA? Is there a reason that we're only running this for a specific test?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason. I fixed this. I made the ASAN run on every debug build in the matrix: both CUDA and Neuron, all compilers, all tracing configs.

@akkart-aws akkart-aws force-pushed the ci/asan_validation branch 5 times, most recently from d54657c to ef75d7c Compare April 28, 2026 23:01
@akkart-aws
Copy link
Copy Markdown
Contributor Author

bot:aws:retest

2 similar comments
@akkart-aws
Copy link
Copy Markdown
Contributor Author

bot:aws:retest

@akkart-aws
Copy link
Copy Markdown
Contributor Author

bot:aws:retest

@anshumang anshumang force-pushed the ci/asan_validation branch from ef75d7c to cccee90 Compare May 9, 2026 04:06
akkart-aws and others added 3 commits May 8, 2026 21:12
Each test function (no_change_check, addition_check, late_check,
replace_check) allocated env arrays and strings via malloc/strdup
but never freed them. Add cleanup at the end of each function.

This allows running unit tests under ASAN with leak detection
enabled and no suppression file.

Signed-off-by: Arun Karthik <akkart@amazon.com>
Enable --enable-asan on all debug builds across AL2023, Ubuntu 22.04,
and Ubuntu 24.04 for both gcc and clang. ASAN leak detection env vars
(detect_leaks=1, halt_on_error=1, exitcode=1) are set on make check
so any memory leak in unit tests fails CI.

Fix nccl_ofi_memcheck_asan.h to support clang's ASAN detection
(__has_feature(address_sanitizer)) in addition to gcc's
__SANITIZE_ADDRESS__ macro.

Add libasan to the AL2023 container Dockerfile since it is not
bundled with gcc on Amazon Linux.

Signed-off-by: Arun Karthik <akkart@amazon.com>
The Dockerfile.al2023 change adding libasan only takes effect after
update-containers.yaml runs (scheduled weekly or workflow_dispatch).
Until then, PR CI pulls the existing image which lacks libasan, and
debug builds linked with -fsanitize=address fail at test runtime with
'libasan.so.6: cannot open shared object file'.

Install libasan inline for debug builds so this PR's CI can validate
ASAN leak detection. The step becomes a no-op once the container
image is rebuilt and can be removed in a follow-up.

Signed-off-by: Anshuman Goswami <anshumgo@amazon.com>
@anshumang anshumang force-pushed the ci/asan_validation branch from cccee90 to c3bfa00 Compare May 9, 2026 04:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants