Skip to content

[Bug] Unstable dispatch time consumption of DeepSeek-V3.1 with PD disaggregation #14813

@Yi-sir

Description

@Yi-sir

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

Hello sglang team.

When testing the performance of the DeepSeek model on an H200*8 machine using the PD disaggregation architecture (1P1D), I observed the following issue:

In the prefill stage, there may be individual decode layers where the dispatch time consumption is unusually long.

Image

This issue is not observed when the model is launched on a single node without PD disaggregation.
Additionally, when using DeepEP's test_intranode.py, this problem does not seem to appear either.

root@sophgo5:/sgl-workspace/DeepEP/tests# python3 ./test_intranode.py
[config] num_tokens=4096, hidden=7168, num_topk=8
[layout] Kernel performance: 0.041 ms

[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with FP8, without top-k (async=False, previous=False) ... passed
[testing] Running with FP8, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with FP8, without top-k (async=True, previous=False) ... passed
[testing] Running with FP8, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with FP8, without top-k (async=False, previous=True) ... passed
[testing] Running with FP8, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with FP8, without top-k (async=True, previous=True) ... passed
[testing] Running with FP8, with top-k (async=True, previous=True) ... passed

[tuning] SMs 24, NVL chunk 4: 287.57 GB/s (NVL), avg_t: 557.11 us
[tuning] SMs 24, NVL chunk 6: 316.77 GB/s (NVL), avg_t: 505.75 us
[tuning] SMs 24, NVL chunk 8: 317.85 GB/s (NVL), avg_t: 504.04 us
[tuning] SMs 24, NVL chunk 10: 317.84 GB/s (NVL), avg_t: 504.05 us
[tuning] SMs 24, NVL chunk 12: 317.64 GB/s (NVL), avg_t: 504.37 us
[tuning] SMs 24, NVL chunk 14: 308.84 GB/s (NVL), avg_t: 518.73 us
[tuning] SMs 24, NVL chunk 16: 303.24 GB/s (NVL), avg_t: 528.32 us
[tuning] SMs 24, NVL chunk 18: 301.04 GB/s (NVL), avg_t: 532.17 us
[tuning] SMs 24, NVL chunk 20: 296.39 GB/s (NVL), avg_t: 540.53 us
[tuning] SMs 24, NVL chunk 22: 296.14 GB/s (NVL), avg_t: 540.98 us
[tuning] SMs 24, NVL chunk 24: 296.08 GB/s (NVL), avg_t: 541.09 us
[tuning] SMs 24, NVL chunk 26: 293.67 GB/s (NVL), avg_t: 545.53 us
[tuning] SMs 24, NVL chunk 28: 293.71 GB/s (NVL), avg_t: 545.46 us
[tuning] SMs 24, NVL chunk 30: 291.43 GB/s (NVL), avg_t: 549.73 us
[tuning] SMs 24, NVL chunk 32: 291.52 GB/s (NVL), avg_t: 549.56 us
[tuning] SMs 24, NVL chunk default: 318.41 GB/s (NVL), avg_t: 503.15 us
[tuning] Best dispatch (FP8): SMs 24, NVL chunk 8, 317.85 GB/s (NVL), t: 504.04 us

Could you please advise on what might be causing this? Thank you.

Reproduction

# P
python3 -m sglang.launch_server \
    --host  0.0.0.0 \
    --port 30000 \
    --watchdog-timeout 1000000 \
    --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 \
    --disaggregation-mode prefill \
    --attention-backend=fa3 \
    --served-model-name deepseek-v3.1 \
    --model /models/DeepSeek-V3.1-Terminus \
    --trust-remote-code \
    --disable-cuda-graph \
    --chunked-prefill-size=32768 \
    --tp 8 --page-size 64 --dp 8 --enable-dp-attention \
    --ep 8 --moe-a2a-backend deepep --deepep-mode normal \
    --max-running-requests=512 \
    --mem-fraction-static=0.7
 
# D
python3 -m sglang.launch_server \
    --host  0.0.0.0 \
    --port 30000 \
    --watchdog-timeout 1000000 \
    --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 \
    --disaggregation-mode decode \
    --served-model-name deepseek-v3.1 \
    --model /models/DeepSeek-V3.1-Terminus \
    --trust-remote-code \
    --tool-call-parser deepseekv3 \
    --chat-template  /sgl-workspace/sglang/examples/chat_template/tool_chat_template_deepseekv3.jinja \
    --tp 8 --page-size 64 \
    --speculative-algorithm EAGLE --speculative-num-steps=2 --speculative-eagle-topk=1 --speculative-num-draft-tokens=3 \
    --max-running-requests=128 \
    --cuda-graph-max-bs 128 \
    --mem-fraction-static=0.8
 
# router
python -m sglang_router.launch_router --pd-disaggregation \
    --prefill http://$prefill:30000 \
    --decode http://$decode:30000 \
    --host 0.0.0.0 \
    --port $router_port

# profile on P node
curl -X POST http://localhost:30000/start_profile

evalscope perf   --parallel 1500   --model deepseek-ai/DeepSeek-V3.1-Terminus   --url http://$router_ip:$router_port/v1/chat/completions   --api openai   --dataset random   --min-tokens 60   --max-tokens 60   --min-prompt-length 400   --max-prompt-length 400   --number 100   --rate 100   --tokenizer-path deepseek-ai/DeepSeek-V3.1-Terminus

curl -X POST http://localhost:30000/stop_profile

Environment

root@sophgo5:/sgl-workspace/DeepEP/tests# python3 -m sglang.check_env
Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 570.148.08
PyTorch: 2.8.0+cu129
sglang: 0.5.4.post3
sgl_kernel: 0.3.16.post4
flashinfer_python: 0.5.0
flashinfer_cubin: 0.5.0
flashinfer_jit_cache: Module Not Found
triton: 3.4.0
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.3.4
aiohttp: 3.13.2
fastapi: 0.121.0
hf_transfer: 0.1.9
huggingface_hub: 0.36.0
interegular: 0.3.3
modelscope: 1.31.0
orjson: 3.11.4
outlines: 0.1.11
packaging: 25.0
psutil: 7.1.3
pydantic: 2.12.3
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.38.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.25
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.72.0
litellm: Module Not Found
decord2: 2.0.0
NVIDIA Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     NODE    NODE    SYS     SYS     SYS     SYS     SYS     0-23,96-119     0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    PIX     NODE    SYS     SYS     SYS     SYS     SYS     0-23,96-119     0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    NODE    PIX     SYS     SYS     SYS     SYS     SYS     0-23,96-119     0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     24-47,120-143   1               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     PIX     NODE    NODE    SYS     48-71,144-167   2               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     NODE    PIX     NODE    SYS     48-71,144-167   2               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     NODE    NODE    PIX     SYS     48-71,144-167   2               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     72-95,168-191   3               N/A
NIC0    PIX     NODE    NODE    SYS     SYS     SYS     SYS     SYS      X      NODE    NODE    SYS     SYS     SYS     SYS     SYS
NIC1    NODE    PIX     NODE    SYS     SYS     SYS     SYS     SYS     NODE     X      NODE    SYS     SYS     SYS     SYS     SYS
NIC2    NODE    NODE    PIX     SYS     SYS     SYS     SYS     SYS     NODE    NODE     X      SYS     SYS     SYS     SYS     SYS
NIC3    SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     PIX     NODE    NODE    SYS     SYS     SYS     SYS     SYS      X      NODE    NODE    SYS
NIC5    SYS     SYS     SYS     SYS     NODE    PIX     NODE    SYS     SYS     SYS     SYS     SYS     NODE     X      NODE    SYS
NIC6    SYS     SYS     SYS     SYS     NODE    NODE    PIX     SYS     SYS     SYS     SYS     SYS     NODE    NODE     X      SYS
NIC7    SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7


ulimit soft: 1048576

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions