Skip to content

bvar: fix bad_weak_ptr crash in AgentCombiner::get_or_create_tls_agent#3306

Open
Gmymymy wants to merge 1 commit into
apache:masterfrom
Gmymymy:fix/bad-weak-ptr-in-agent-combiner
Open

bvar: fix bad_weak_ptr crash in AgentCombiner::get_or_create_tls_agent#3306
Gmymymy wants to merge 1 commit into
apache:masterfrom
Gmymymy:fix/bad-weak-ptr-in-agent-combiner

Conversation

@Gmymymy
Copy link
Copy Markdown

@Gmymymy Gmymymy commented May 22, 2026

Problem

When a bvar object (Reducer, IntRecorder, Percentile) is destroyed
while another thread is concurrently writing to it via operator<<, the
following crash occurs:

terminate called after throwing an instance of 'std::bad_weak_ptr'
  what():  bad_weak_ptr
Aborted (core dumped)

This has been observed in high-concurrency RDMA performance testing
(see #3288).

Root Cause

AgentCombiner inherits from std::enable_shared_from_this.
In get_or_create_tls_agent(), when an agent's combiner weak_ptr has
expired (the previous combiner was destroyed and the slot was reused), the
method calls this->shared_from_this() to re-bind the agent to the
current combiner.

shared_from_this() throws std::bad_weak_ptr if the object is not
currently managed by any shared_ptr. This can happen in a race where:

  1. The AgentCombiner is the last object keeping a bvar alive
  2. Another thread releases the last shared_ptr to it (e.g., the owning
    Reducer goes out of scope or is destroyed during program shutdown)
  3. A third thread is simultaneously inside get_or_create_tls_agent(),
    past the agent->combiner.expired() check, and calls shared_from_this()
    on the now-unmanaged object → bad_weak_ptrterminate()

Fix

Wrap shared_from_this() in a try/catch(std::bad_weak_ptr). When caught,
return NULL and silently skip the recording. This is safe: the metric is
being torn down, so dropping a write in flight is acceptable and far
preferable to crashing.

Also remove the LOG(FATAL) in the three operator<< callers that fire
when get_or_create_tls_agent() returns NULL:

  • For allocation failure, get_or_create_tls_agent() already calls
    LOG(FATAL) internally (which aborts); the outer LOG(FATAL) was
    unreachable in that case.
  • For the new combiner-expired path, calling LOG(FATAL) would
    incorrectly abort the process for a benign race during teardown.

Affected files

File Change
src/bvar/detail/combiner.h catch bad_weak_ptr in get_or_create_tls_agent()
src/bvar/reducer.h remove outer LOG(FATAL) from operator<<
src/bvar/recorder.h remove outer LOG(FATAL) from operator<<
src/bvar/detail/percentile.cpp remove outer LOG(FATAL) from operator<<

Fixes #3288

When a bvar (Reducer/IntRecorder/Percentile) is destroyed while another
thread is concurrently writing to it, the AgentCombiner may no longer be
managed by any shared_ptr by the time get_or_create_tls_agent() calls
shared_from_this(). This causes std::bad_weak_ptr to be thrown and the
process to terminate.

Fix this by wrapping shared_from_this() in a try-catch: if the combiner
is no longer alive, silently return NULL and skip the recording. This is
safe because the metric is being torn down anyway.

Also remove the now-incorrect LOG(FATAL) in the three operator<< callers.
For allocation failures, get_or_create_tls_agent() already calls
LOG(FATAL) internally (and aborts); the outer LOG(FATAL) was unreachable
in that case and would incorrectly abort the process for the
combiner-expired case.

Fixes apache#3288
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

example/rdma_performance性能测试偶发报错bad_weak_ptr,导致core dumped

2 participants