Skip to content

remove nccl pd mode.#1342

Merged
hiworldwzj merged 17 commits into
mainfrom
wzj_del_pd_new
Jun 10, 2026
Merged

remove nccl pd mode.#1342
hiworldwzj merged 17 commits into
mainfrom
wzj_del_pd_new

Conversation

@hiworldwzj

Copy link
Copy Markdown
Collaborator

No description provided.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Prefill-Decode (PD) disaggregation mode by consolidating transport backends, renaming NIXL-specific components to generic PD components, and removing global synchronization locks (such as g_infer_state_lock and g_router_lock) to simplify the concurrency model. It also introduces support for an alternative NCCL-based KV transporter backend. Feedback on these changes highlights a critical GPU race condition in decode_trans_process.py when using the NCCL backend due to a lack of stream synchronization. Additionally, the reviewer noted that hardcoding self.is_pd_decode_mode = False in the detokenization manager disables decode mode checks, and suggested using .strip().lower() when parsing the LIGHTLLM_PD_KV_TRANSPORT_BACKEND environment variable for safer input handling.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines 385 to 386
copy_start_event.record(self.copy_cuda_stream)
cur_mem = self.mem_managers[self.device_id]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

When using the NCCL backend, the NCCL receive operation (comm.recv) is enqueued asynchronously on the peer's stream (_NcclPeer._get_stream()). However, the actual copying of the KV cache pages to memory (read_page_kv_move_buffer_to_mem) is executed on self.copy_cuda_stream without any stream synchronization. This creates a severe GPU race condition (data hazard) where the copy stream can read from kv_move_buffer before the NCCL receive has finished writing to it, leading to corrupted KV cache values. Adding stream synchronization via wait_stream resolves this hazard safely.

                if hasattr(self.transporter, "_get_peer"):
                    peer = self.transporter._get_peer(trans_task.prefill_agent_name)
                    self.copy_cuda_stream.wait_stream(peer._get_stream())
                copy_start_event.record(self.copy_cuda_stream)
                cur_mem = self.mem_managers[self.device_id]

self.eos_id = args.eos_id
self._init_get_token_id_to_token_str()
self.is_pd_decode_mode = self.args.run_mode == "decode"
self.is_pd_decode_mode = False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding self.is_pd_decode_mode = False disables the PD decode mode check in the detokenization manager, even when running in "decode" mode. If detokenization should be bypassed or handled specially on decode nodes, this should be set to self.args.run_mode == "decode" as it was before.

Suggested change
self.is_pd_decode_mode = False
self.is_pd_decode_mode = self.args.run_mode == "decode"



def create_kv_transporter(args: StartArgs, node_id: int, tp_idx: int, kv_move_buffer: Tensor):
backend = os.getenv("LIGHTLLM_PD_KV_TRANSPORT_BACKEND", "nixl").lower()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It is safer to use .strip().lower() when retrieving the LIGHTLLM_PD_KV_TRANSPORT_BACKEND environment variable to handle any leading/trailing whitespaces gracefully.

Suggested change
backend = os.getenv("LIGHTLLM_PD_KV_TRANSPORT_BACKEND", "nixl").lower()
backend = os.getenv("LIGHTLLM_PD_KV_TRANSPORT_BACKEND", "nixl").strip().lower()

@hiworldwzj hiworldwzj merged commit 8196f35 into main Jun 10, 2026
1 check passed
@hiworldwzj hiworldwzj deleted the wzj_del_pd_new branch June 10, 2026 12:26
@hiworldwzj hiworldwzj restored the wzj_del_pd_new branch June 10, 2026 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant