remove nccl pd mode.#1342
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the Prefill-Decode (PD) disaggregation mode by consolidating transport backends, renaming NIXL-specific components to generic PD components, and removing global synchronization locks (such as g_infer_state_lock and g_router_lock) to simplify the concurrency model. It also introduces support for an alternative NCCL-based KV transporter backend. Feedback on these changes highlights a critical GPU race condition in decode_trans_process.py when using the NCCL backend due to a lack of stream synchronization. Additionally, the reviewer noted that hardcoding self.is_pd_decode_mode = False in the detokenization manager disables decode mode checks, and suggested using .strip().lower() when parsing the LIGHTLLM_PD_KV_TRANSPORT_BACKEND environment variable for safer input handling.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| copy_start_event.record(self.copy_cuda_stream) | ||
| cur_mem = self.mem_managers[self.device_id] |
There was a problem hiding this comment.
When using the NCCL backend, the NCCL receive operation (comm.recv) is enqueued asynchronously on the peer's stream (_NcclPeer._get_stream()). However, the actual copying of the KV cache pages to memory (read_page_kv_move_buffer_to_mem) is executed on self.copy_cuda_stream without any stream synchronization. This creates a severe GPU race condition (data hazard) where the copy stream can read from kv_move_buffer before the NCCL receive has finished writing to it, leading to corrupted KV cache values. Adding stream synchronization via wait_stream resolves this hazard safely.
if hasattr(self.transporter, "_get_peer"):
peer = self.transporter._get_peer(trans_task.prefill_agent_name)
self.copy_cuda_stream.wait_stream(peer._get_stream())
copy_start_event.record(self.copy_cuda_stream)
cur_mem = self.mem_managers[self.device_id]| self.eos_id = args.eos_id | ||
| self._init_get_token_id_to_token_str() | ||
| self.is_pd_decode_mode = self.args.run_mode == "decode" | ||
| self.is_pd_decode_mode = False |
There was a problem hiding this comment.
Hardcoding self.is_pd_decode_mode = False disables the PD decode mode check in the detokenization manager, even when running in "decode" mode. If detokenization should be bypassed or handled specially on decode nodes, this should be set to self.args.run_mode == "decode" as it was before.
| self.is_pd_decode_mode = False | |
| self.is_pd_decode_mode = self.args.run_mode == "decode" |
|
|
||
|
|
||
| def create_kv_transporter(args: StartArgs, node_id: int, tp_idx: int, kv_move_buffer: Tensor): | ||
| backend = os.getenv("LIGHTLLM_PD_KV_TRANSPORT_BACKEND", "nixl").lower() |
There was a problem hiding this comment.
It is safer to use .strip().lower() when retrieving the LIGHTLLM_PD_KV_TRANSPORT_BACKEND environment variable to handle any leading/trailing whitespaces gracefully.
| backend = os.getenv("LIGHTLLM_PD_KV_TRANSPORT_BACKEND", "nixl").lower() | |
| backend = os.getenv("LIGHTLLM_PD_KV_TRANSPORT_BACKEND", "nixl").strip().lower() |
No description provided.