Skip to content

based refactors: Improve init/cleanup and split Corosync-related functions into new file#4135

Open
nrwahl2 wants to merge 28 commits into
ClusterLabs:mainfrom
nrwahl2:nrwahl2-based_next
Open

based refactors: Improve init/cleanup and split Corosync-related functions into new file#4135
nrwahl2 wants to merge 28 commits into
ClusterLabs:mainfrom
nrwahl2:nrwahl2-based_next

Conversation

@nrwahl2

@nrwahl2 nrwahl2 commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Next batch from #4011

@nrwahl2 nrwahl2 marked this pull request as draft June 22, 2026 20:31
nrwahl2 added 28 commits June 24, 2026 13:52
Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
The NULL checks will get de-duplicated in an upcoming commit.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Also move the ipcs_{ro,rw,shm} extern declarations from
based_callbacks.h to based_ipc.h, now that there is a more appropriate
home for them. And make ipc_{ro,rw}_callbacks static to based_ipc.c,
since we no longer use them in pacemaker-based.c.

The goal is to look more like attrd, the fencer, and the scheduler.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
The goal is to look more like attrd, the fencer, and the scheduler.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
This seems like a more correct location for it. based_terminate() is
where we free data structures.

This also makes the calls to pcmk__stop_based_ipc() redundant, and
allows making the qb_ipcs_service_t declarations static.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Also make -1 the "uninitialized" value. 0 is a valid file descriptor,
even though it's much more likely to be used for stdin.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
To look more like attrd and fenced. More change are coming.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
It doesn't make sense to have this handful of unrelated initializations
in a separate function, when most of main() is for initializing things.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Place the call right before the first thing that needs it
(based_activate_cib()). The rest of the initialization helpers are
called in this region, so let's collect them. We can't make them all
contiguous though -- based_remote_init() requires that the CIB be read
and activated first.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
To mirror attrd and fenced.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
pcmk__corosync_connect() doesn't use the caches directly.
pcmk__get_node() initializes the caches when it gets called.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Using include-what-you-use.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
It shouldn't matter whether this is a Corosync cluster or not.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
To mirror attrd and fenced.

Notes:
* Previously, we weren't freeing the cluster object on exit in
  based_terminate(). Now we are.
* pcmk_cluster_free() calls pcmk__cluster_destroy_node_caches(), which
  is why we drop the call to that function.
* I'm fairly certain that the reason the pcmk_cluster_disconnect() call
  previously occurred before the done section, is that prior to a recent
  commit, we weren't NULL-checking the cluster argument before
  disconnect. We should be able to call based_cluster_disconnect()
  regardless of how based_terminate() wherever we want to free the
  cluster object.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
And NULL-check the attrd_cluster variable.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
pcmk_cluster_disconnect() now returns EINVAL if passed a NULL argument.
Previously, if given a NULL argument in a Corosync cluster,
pcmk_cluster_disconnect() would call down to pcmk__cpg_disconnect(),
which would dereference the NULL pointer.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
To mirror attrd and the fencer.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
The CIB manager started sending a shutdown request with commit 1f61358
(2006). There is not much detail about the issue this aimed to fix.
However, the commit message says the following:

"""
Fix for a timing issue where a remote CCM registered the DC exited
before the last CIB update(s) from the DC were applied.
...
As long as one node has the update, the join process will make sure it
ends up on the new DC. If for some reason *none* of our peers reply, we
still exit in at most N seconds and we're no worse off than we were
before.
"""

It very much sounds like the problem was tied to legacy mode and however
we handled CIB updates back then. As of Pacemaker 1.1.12, all modifying
operations get forwarded and processed on all nodes. Commit 15c4d2a:

"""
Feature: cib: Send all r/w operations via the cluster connection and
have all nodes process them

The idea of a master cib instance goes away, and all instances apply all
raw (and now ordered thanks to corosync) updates (not diffs)

This ensures updates are never lost, even if there is no DC
"""

That was the replacement for legacy mode behavior. Given that modifying
ops get forwarded to all nodes, I don't see how "...a remote CCM
registered the DC exited before the last CIB update(s) from the DC were
applied" would present a problem. If the DC's CIB manager receives a
modifying request, it will forward it via the cluster layer to all nodes
for processing.

A request could always be dropped if a node abruptly leaves the cluster
before it's able to forward a request that a client sent, but there's
nothing we can do about that. At least the CIB would remain in a
consistent state.

Note that this also makes the peer status callback pointless. So we drop
it, and we also drop the associated "-1" logic from based_terminate().

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
And make it static. It seems logical to have it in the same file as the
main() function.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
This gets rid of the last use of based_cluster outside of
based_corosync.c.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
And rename to cluster. We don't typically care much about making
variable names distinct, although we try to make function names
distinct.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
To replace cib_shutdown_flag.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
To replace global stand_alone variable.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
To drop a global.

Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
Signed-off-by: Reid Wahl <nrwahl@protonmail.com>
@nrwahl2 nrwahl2 force-pushed the nrwahl2-based_next branch from add6809 to 93dc367 Compare June 24, 2026 20:54
@nrwahl2 nrwahl2 marked this pull request as ready for review June 24, 2026 20:54
@nrwahl2 nrwahl2 requested a review from clumens June 24, 2026 20:54
@nrwahl2

nrwahl2 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

I haven't run cts-lab on this in a while (assuming I ever did so), so that should be done before merging.

@nrwahl2

nrwahl2 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

I haven't run cts-lab on this in a while (assuming I ever did so), so that should be done before merging.

Running now. 12 tests successful so far.

@nrwahl2

nrwahl2 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

Running now. 12 tests successful so far.

Did a 35-test run, all good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant