When configuring PD decomposition with 2p2d, the second prefiller fails to start properly.


## My scenario is as follows:
* There are a total of 5 pods, with roles two prefillers and 2 decoders, and 1 router.
* Both the prefiller and decoder pods each have one L 20 GPU device.
* Environment without RDMA support.

## The startup command is as follows:
* prefiller-1

python -m sglang.launch_server   --model-path /root/.cache/huggingface/Qwen3-4B   --disaggregation-mode prefill   --host prefiller1-ip --port 30000   --trust-remote-code   --dist-init-addr
prefiller1-ip:5000   --nnodes 2   --node-rank 0   --tp-size 2   --dp-size 1   --enable-dp-attention --mem-fraction-static 0.8 --log-level debug

* prefiller-2

python -m sglang.launch_server   --model-path /root/.cache/huggingface/Qwen3-4B   --disaggregation-mode prefill   --host prefiller2-ip --port 30000   --trust-remote-code   --dist-init-addr
 prefiller1-ip:5000   --nnodes 2   --node-rank 1   --tp-size 2  --dp-size 1   --enable-dp-attention  --mem-fraction-static 0.8 --log-level debug

* decoder-1

python -m sglang.launch_server   --model-path /root/.cache/huggingface/Qwen3-4B   --disaggregation-mode decode   --host decoder1-ip  --port 30001   --trust-remote-code   --dist-init-addr
 decoder1-ip:5000   --nnodes 2   --node-rank 0   --tp-size 2   --dp-size 1   --enable-dp-attention --mem-fraction-static 0.8   --max-running-requests 128

* decoder-2

python -m sglang.launch_server   --model-path /root/.cache/huggingface/Qwen3-4B   --disaggregation-mode decode   --host decoder2-ip  --port 30001   --trust-remote-code   --dist-init-addr
  decoder1-ip:5000   --nnodes 2   --node-rank 1   --tp-size 2   --dp-size 1   --enable-dp-attention --mem-fraction-static 0.8   --max-running-requests 128

* router

python -m sglang_router.launch_router --pd-disaggregation --prefill http://prefiller1-ip:30000 --prefill http://prefiller2-ip:30000 --decode http://decode1-ip:30001 --decode http://decoder2-ip:30001  --host 0.0.0.0 --port 8000

## The log is as follows:

* prefiller1(normal)
```
[2025-12-11 07:31:28 TP0] kv manager bind to 10.64.3.56:45509
[2025-12-11 07:31:28 TP0] Starting new HTTP connection (1): 10.64.3.56:8998
[2025-12-11 07:31:28] Register prefill bootstrap: DP0 TP0 PP0 with rank_ip: 10.64.3.56 and rank_port: 45509
[2025-12-11 07:31:28] 10.64.3.56 [11/Dec/2025:07:31:28 +0000] "PUT /route HTTP/1.1" 200 154 "-" "python-requests/2.32.5"
[2025-12-11 07:31:28 TP0] http://10.64.3.56:8998 "PUT /route HTTP/1.1" 200 2
[2025-12-11 07:31:28 TP0] Prefill successfully registered to bootstrap server.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1211 07:31:28.056981  6314 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1211 07:31:28.057013  6314 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.64.3.56 port: 12001
I1211 07:31:28.057044  6314 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.64.3.56:15863
I1211 07:31:28.057222  6314 transfer_engine.cpp:185] Auto-discovering topology...
W1211 07:31:28.057345  6314 topology.cpp:55] No RDMA devices found, check your device installation
I1211 07:31:28.057384  6314 transfer_engine.cpp:200] Topology discovery complete. Found 0 HCAs.
I1211 07:31:28.057410  6314 tcp_transport.cpp:299] TcpTransport: listen on port 15980
[2025-12-11 07:31:28] INFO:     Started server process [6184]
[2025-12-11 07:31:28] INFO:     Waiting for application startup.
[2025-12-11 07:31:28] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[2025-12-11 07:31:28] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[2025-12-11 07:31:28] INFO:     Application startup complete.
[2025-12-11 07:31:28] INFO:     Uvicorn running on http://10.64.3.56:30000 (Press CTRL+C to quit)
[2025-12-11 07:31:29] Starting new HTTP connection (1): 10.64.3.56:30000
[2025-12-11 07:31:29] Endpoint '/get_model_info' is deprecated and will be removed in a future version. Please use '/model_info' instead.
[2025-12-11 07:31:29] INFO:     10.64.3.56:41116 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-12-11 07:31:29] http://10.64.3.56:30000 "GET /get_model_info HTTP/1.1" 200 306
[2025-12-11 07:31:29] Start of pd disaggregation warmup ...
[2025-12-11 07:31:29] Starting new HTTP connection (1): 10.64.3.56:30000
[2025-12-11 07:31:29] Starting batch tokenization for 1 text requests
[2025-12-11 07:31:29 TP0] Processing batch generate request with 1 requests
[2025-12-11 07:31:29 TP0] FakeKVSender init with kv_indices: 4, aux_index: 0
[2025-12-11 07:31:29 TP0] Prefill batch, #new-seq: 1, #new-token: 4, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #prealloc-req: 0, #inflight-req: 0, input throughput (token/s): 0.00,
[2025-12-11 07:31:29 TP0] Attempting to acquire lock 140245255167712 on /root/.cache/flashinfer/0.5.3/89/cached_ops/tmp/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False.lock
[2025-12-11 07:31:29 TP0] Lock 140245255167712 acquired on /root/.cache/flashinfer/0.5.3/89/cached_ops/tmp/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False.lock
[2025-12-11 07:31:29 TP0] Attempting to release lock 140245255167712 on /root/.cache/flashinfer/0.5.3/89/cached_ops/tmp/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False.lock
[2025-12-11 07:31:29 TP0] Lock 140245255167712 released on /root/.cache/flashinfer/0.5.3/89/cached_ops/tmp/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False.lock
[2025-12-11 07:31:30 TP0] FakeKVSender send with kv_indices: [1 2 3 4], state_indices: None
[2025-12-11 07:31:30 TP0] FakeKVSender poll success
[2025-12-11 07:31:30] INFO:     10.64.3.56:41132 - "POST /generate HTTP/1.1" 200 OK
[2025-12-11 07:31:30] http://10.64.3.56:30000 "POST /generate HTTP/1.1" 200 318
[2025-12-11 07:31:30] End of prefill disaggregation mode warmup with status 200, resp: [{'text': '%', 'output_ids': [4], 'meta_info': {'id': '1ef6e2124cda44819b54ea4c9723a54a', 'finish_reason': {'type': 'length', 'length': 0}, 'prompt_tokens': 4, 'weight_version': 'default', 'total_retractions': 0, 'completion_tokens': 1, 'cached_tokens': 0, 'e2e_latency': 1.7358613014221191, 'response_sent_to_client_ts': 1765438290.8528178}}]
[2025-12-11 07:31:30] The server is fired up and ready to roll!
```

* prefiller2(abnormal)

It stopped without the server starting normally, and it didn't crash either.
```
[2025-12-11 07:31:28 TP1] http://10.64.3.56:8998 "PUT /route HTTP/1.1" 200 2
[2025-12-11 07:31:28 TP1] Prefill successfully registered to bootstrap server.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I1211 07:31:28.023576  2175 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I1211 07:31:28.023595  2175 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 10.64.6.201 port: 12001
I1211 07:31:28.023624  2175 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 10.64.6.201:16275
I1211 07:31:28.023764  2175 transfer_engine.cpp:185] Auto-discovering topology...
W1211 07:31:28.023859  2175 topology.cpp:55] No RDMA devices found, check your device installation
I1211 07:31:28.023886  2175 transfer_engine.cpp:200] Topology discovery complete. Found 0 HCAs.
I1211 07:31:28.023903  2175 tcp_transport.cpp:299] TcpTransport: listen on port 15901
[2025-12-11 07:31:28] Dummy health check server started in background thread at 10.64.6.201:30000
[2025-12-11 07:31:29 TP1] Processing batch generate request with 1 requests
[2025-12-11 07:31:29 TP1] FakeKVSender init with kv_indices: 4, aux_index: 0
[2025-12-11 07:31:30 TP1] Attempting to acquire lock 139800112822416 on /root/.cache/flashinfer/0.5.3/89/cached_ops/tmp/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False.lock
[2025-12-11 07:31:30 TP1] Lock 139800112822416 acquired on /root/.cache/flashinfer/0.5.3/89/cached_ops/tmp/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False.lock
[2025-12-11 07:31:30 TP1] Attempting to release lock 139800112822416 on /root/.cache/flashinfer/0.5.3/89/cached_ops/tmp/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False.lock
[2025-12-11 07:31:30 TP1] Lock 139800112822416 released on /root/.cache/flashinfer/0.5.3/89/cached_ops/tmp/batch_prefill_with_kv_cache_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_qk_128_head_dim_vo_128_posenc_0_use_swa_False_use_logits_cap_False_f16qk_False.lock
[2025-12-11 07:31:30 TP1] FakeKVSender send with kv_indices: [1 2 3 4], state_indices: None
[2025-12-11 07:31:30 TP1] FakeKVSender poll success
```

* decoder1 and decoder2

10.64.6.201 is prefiller2's ip
```
[2025-12-11 07:54:27 TP0] Error fetching prefill parallel info from bootstrap: HTTPConnectionPool(host='10.64.6.201', port=8998): Max retries exceeded with url: /route?engine_rank=-1&target_dp_group=-1&target_pp_rank=-1 (Caused by NewConnectionError("HTTPConnection(host='10.64.6.201', port=8998): Failed to establish a new connection: [Errno 111] Connection refused"))
[2025-12-11 07:54:27 TP0] Decode transfer failed for request rank=0 decode_req.req.rid='25e54d719d01468bbac125e73db43b02' decode_req.req.bootstrap_room=8865415883299881072 with exception KVTransferError(bootstrap_room=8865415883299881072): Could not fetch prefill parallel info from bootstrap_addr: 10.64.6.201:8998
```

* router

It seems to look normal.
```
2025-12-11 08:07:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:434: Tenant: http://10.64.3.56:30000, Size: 0
2025-12-11 08:07:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:434: Tenant: http://10.64.6.246:30001, Size: 0
2025-12-11 08:07:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:484: After eviction - Used size per tenant:
2025-12-11 08:07:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:486: Tenant: http://10.64.3.56:30000, Size: 0
2025-12-11 08:07:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:486: Tenant: http://10.64.6.246:30001, Size: 0
2025-12-11 08:09:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:432: Before eviction - Used size per tenant:
2025-12-11 08:09:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:434: Tenant: http://10.64.6.201:30000, Size: 0
2025-12-11 08:09:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:434: Tenant: http://10.64.6.193:30001, Size: 0
2025-12-11 08:09:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:484: After eviction - Used size per tenant:
2025-12-11 08:09:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:486: Tenant: http://10.64.6.201:30000, Size: 0
2025-12-11 08:09:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:486: Tenant: http://10.64.6.193:30001, Size: 0
2025-12-11 08:09:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:432: Before eviction - Used size per tenant:
2025-12-11 08:09:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:434: Tenant: http://10.64.3.56:30000, Size: 0
2025-12-11 08:09:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:434: Tenant: http://10.64.6.246:30001, Size: 0
2025-12-11 08:09:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:484: After eviction - Used size per tenant:
2025-12-11 08:09:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:486: Tenant: http://10.64.3.56:30000, Size: 0
2025-12-11 08:09:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:486: Tenant: http://10.64.6.246:30001, Size: 0
2025-12-11 08:11:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:432: Before eviction - Used size per tenant:
2025-12-11 08:11:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:434: Tenant: http://10.64.6.201:30000, Size: 0
2025-12-11 08:11:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:434: Tenant: http://10.64.6.193:30001, Size: 0
2025-12-11 08:11:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:484: After eviction - Used size per tenant:
2025-12-11 08:11:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:486: Tenant: http://10.64.6.201:30000, Size: 0
2025-12-11 08:11:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:486: Tenant: http://10.64.6.193:30001, Size: 0
2025-12-11 08:11:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:432: Before eviction - Used size per tenant:
2025-12-11 08:11:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:434: Tenant: http://10.64.3.56:30000, Size: 0
2025-12-11 08:11:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:434: Tenant: http://10.64.6.246:30001, Size: 0
2025-12-11 08:11:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:484: After eviction - Used size per tenant:
2025-12-11 08:11:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:486: Tenant: http://10.64.3.56:30000, Size: 0
2025-12-11 08:11:35  INFO sgl_model_gateway::policies::tree: /sgl-workspace/sglang/sgl-model-gateway/src/policies/tree.rs:486: Tenant: http://10.64.6.246:30001, Size: 0
```

## Reference documents

https://docs.sglang.io/advanced_features/pd_disaggregation.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

When configuring PD decomposition with 2p2d, the second prefiller fails to start properly. #14882

My scenario is as follows:

The startup command is as follows:

The log is as follows:

Reference documents

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

When configuring PD decomposition with 2p2d, the second prefiller fails to start properly. #14882

Description

My scenario is as follows:

The startup command is as follows:

The log is as follows:

Reference documents

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions