remove nccl pd mode. by hiworldwzj · Pull Request #1342 · ModelTC/LightLLM

hiworldwzj · 2026-06-10T09:09:24Z

No description provided.

gemini-code-assist

Code Review

This pull request refactors the Prefill-Decode (PD) disaggregation mode by consolidating transport backends, renaming NIXL-specific components to generic PD components, and removing global synchronization locks (such as g_infer_state_lock and g_router_lock) to simplify the concurrency model. It also introduces support for an alternative NCCL-based KV transporter backend. Feedback on these changes highlights a critical GPU race condition in decode_trans_process.py when using the NCCL backend due to a lack of stream synchronization. Additionally, the reviewer noted that hardcoding self.is_pd_decode_mode = False in the detokenization manager disables decode mode checks, and suggested using .strip().lower() when parsing the LIGHTLLM_PD_KV_TRANSPORT_BACKEND environment variable for safer input handling.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-10T09:12:33Z

                copy_start_event.record(self.copy_cuda_stream)
                cur_mem = self.mem_managers[self.device_id]


When using the NCCL backend, the NCCL receive operation (comm.recv) is enqueued asynchronously on the peer's stream (_NcclPeer._get_stream()). However, the actual copying of the KV cache pages to memory (read_page_kv_move_buffer_to_mem) is executed on self.copy_cuda_stream without any stream synchronization. This creates a severe GPU race condition (data hazard) where the copy stream can read from kv_move_buffer before the NCCL receive has finished writing to it, leading to corrupted KV cache values. Adding stream synchronization via wait_stream resolves this hazard safely.

if hasattr(self.transporter, "_get_peer"): peer = self.transporter._get_peer(trans_task.prefill_agent_name) self.copy_cuda_stream.wait_stream(peer._get_stream()) copy_start_event.record(self.copy_cuda_stream) cur_mem = self.mem_managers[self.device_id]

gemini-code-assist · 2026-06-10T09:12:33Z

        self.eos_id = args.eos_id
        self._init_get_token_id_to_token_str()
-        self.is_pd_decode_mode = self.args.run_mode == "decode"
+        self.is_pd_decode_mode = False


Hardcoding self.is_pd_decode_mode = False disables the PD decode mode check in the detokenization manager, even when running in "decode" mode. If detokenization should be bypassed or handled specially on decode nodes, this should be set to self.args.run_mode == "decode" as it was before.

Suggested change

self.is_pd_decode_mode = False

self.is_pd_decode_mode = self.args.run_mode == "decode"

gemini-code-assist · 2026-06-10T09:12:33Z

+
+
+def create_kv_transporter(args: StartArgs, node_id: int, tp_idx: int, kv_move_buffer: Tensor):
+    backend = os.getenv("LIGHTLLM_PD_KV_TRANSPORT_BACKEND", "nixl").lower()


It is safer to use .strip().lower() when retrieving the LIGHTLLM_PD_KV_TRANSPORT_BACKEND environment variable to handle any leading/trailing whitespaces gracefully.

Suggested change

backend = os.getenv("LIGHTLLM_PD_KV_TRANSPORT_BACKEND", "nixl").lower()

backend = os.getenv("LIGHTLLM_PD_KV_TRANSPORT_BACKEND", "nixl").strip().lower()

hiworldwzj added 10 commits June 10, 2026 08:58

remove nccl pd

71626d7

gifix

6aa40a2

fix

37e9121

fix

1f7f265

fix

ab811d5

fix

3b6767c

fix

29a17eb

fix

d6b7cd2

fix

2d5d56a

fix

f48a91c

gemini-code-assist Bot reviewed Jun 10, 2026

View reviewed changes

hiworldwzj added 7 commits June 10, 2026 09:12

fix

c606e43

fix

ca0496b

fix

7e900ac

fix

b852923

fix

854c3b5

fix

128c8c8

fix

6464428

hiworldwzj merged commit 8196f35 into main Jun 10, 2026
1 check passed

hiworldwzj deleted the wzj_del_pd_new branch June 10, 2026 12:26

hiworldwzj restored the wzj_del_pd_new branch June 10, 2026 12:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove nccl pd mode.#1342

remove nccl pd mode.#1342
hiworldwzj merged 17 commits into
mainfrom
wzj_del_pd_new

hiworldwzj commented Jun 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		copy_start_event.record(self.copy_cuda_stream)
		cur_mem = self.mem_managers[self.device_id]

	self.is_pd_decode_mode = False
	self.is_pd_decode_mode = self.args.run_mode == "decode"



		def create_kv_transporter(args: StartArgs, node_id: int, tp_idx: int, kv_move_buffer: Tensor):
		backend = os.getenv("LIGHTLLM_PD_KV_TRANSPORT_BACKEND", "nixl").lower()

Conversation

hiworldwzj commented Jun 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant