-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Checklist
- I searched related issues but found no solution.
- The bug persists in the latest version.
- Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- Please use English. Otherwise, it will be closed.
Describe the bug
I launched GLM-4.6 with this command:
python3 -m sglang.launch_server --host 0.0.0.0 --port 8080 \
--enable-metrics \
--served-model-name zai-org/GLM-4.6 \
--model-path /models/zai-org/GLM-4.6-FP8 \
--tp-size 8 --dtype bfloat16 \
--tool-call-parser glm \
--reasoning-parser glm45 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--mem-fraction-static 0.9 \
--max-running-requests 32 \
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'
Then tried to generate grammar with this requests:
First request with reasoning False:
{
"model": "zai-org/GLM-4.6",
"messages": [
{
"role": "system",
"content": ""
},
{
"role": "user",
"content": "Task:Generate name, age, and student status.\n Please do not reasoning!"
}
],
"temperature": 0.0,
"max_tokens": 32000,
"timeout": 300.0,
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "MultipleBasicFields",
"schema": {
"description": "Test 4: Multiple basic type fields",
"properties": {
"name": {
"title": "Name",
"type": "string"
},
"age": {
"title": "Age",
"type": "integer"
},
"is_student": {
"title": "Is Student",
"type": "boolean"
}
},
"required": [
"name",
"age",
"is_student"
],
"title": "MultipleBasicFields",
"type": "object"
}
},
"strict": true
},
"chat_template_kwargs": {
"enable_thinking": false
}
}
First response :
{
"id": "5da2edf8ca1f4a60a6bec62fd6d8ceec",
"object": "chat.completion",
"created": 1765365409,
"model": "zai-org/GLM-4.6",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "\nName: Alex Chen\nAge: 21\nStudent Status: Undergraduate",
"reasoning_content": null,
"tool_calls": null
},
"logprobs": null,
"finish_reason": "stop",
"matched_stop": 151336
}
],
"usage": {
"prompt_tokens": 27,
"total_tokens": 44,
"completion_tokens": 17,
"prompt_tokens_details": null,
"reasoning_tokens": 0
},
"metadata": {
"weight_version": "default"
}
}
Second request with reasoning True:
{
"model": "zai-org/GLM-4.6",
"messages": [
{
"role": "system",
"content": ""
},
{
"role": "user",
"content": "Task:Generate name, age, and student status.\n Please do not reasoning!"
}
],
"temperature": 0.0,
"max_tokens": 32000,
"timeout": 300.0,
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "MultipleBasicFields",
"schema": {
"description": "Test 4: Multiple basic type fields",
"properties": {
"name": {
"title": "Name",
"type": "string"
},
"age": {
"title": "Age",
"type": "integer"
},
"is_student": {
"title": "Is Student",
"type": "boolean"
}
},
"required": [
"name",
"age",
"is_student"
],
"title": "MultipleBasicFields",
"type": "object"
}
},
"strict": true
},
"chat_template_kwargs": {
"enable_thinking": true
}
}
Second response:
{
"id": "852fcdf0fa9a42dc9184e580fce51c1e",
"object": "chat.completion",
"created": 1765365462,
"model": "zai-org/GLM-4.6",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "{\n \"is_student\": true,\n \"name\": \"Alex Johnson\",\n \"age\": 21\n}",
"reasoning_content": "I'm being asked to generate a name, age, and student status. The instruction specifically says \"Please do not reasoning!\" which means I should provide this information without explaining my thought process.\n\nI'll generate:\n1. A name (I'll pick a common name)\n2. An age (I'll pick a reasonable age for a student)\n3. A student status (like \"current student\", \"graduate\", \"not a student\", etc.)\n\nSince I'm not supposed to reason, I'll just provide these three pieces of information directly without any explanation.",
"tool_calls": null
},
"logprobs": null,
"finish_reason": "stop",
"matched_stop": 151336
}
],
"usage": {
"prompt_tokens": 23,
"total_tokens": 163,
"completion_tokens": 140,
"prompt_tokens_details": null,
"reasoning_tokens": 0
},
"metadata": {
"weight_version": "default"
}
}
As you can see grammar not working with reasoning=False.
Reproduction
python3 -m sglang.launch_server --host 0.0.0.0 --port 8080
--enable-metrics
--served-model-name zai-org/GLM-4.6
--model-path /models/zai-org/GLM-4.6-FP8
--tp-size 8 --dtype bfloat16
--tool-call-parser glm
--reasoning-parser glm45
--speculative-algorithm EAGLE
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
--mem-fraction-static 0.9
--max-running-requests 32
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'
Environment
Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 565.57.01
PyTorch: 2.9.1+cu129
sglang: 0.5.6
sgl_kernel: 0.3.18.post2
flashinfer_python: 0.5.3
flashinfer_cubin: 0.5.3
flashinfer_jit_cache: Module Not Found
triton: 3.5.1
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.3.5
aiohttp: 3.13.2
fastapi: 0.123.5
hf_transfer: 0.1.9
huggingface_hub: 0.36.0
interegular: 0.3.3
modelscope: 1.32.0
orjson: 3.11.4
outlines: 0.1.11
packaging: 25.0
psutil: 7.1.3
pydantic: 2.12.5
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.38.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.27
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.75.0
litellm: Module Not Found
decord2: 2.0.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 0-15,64-79 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 0-15,64-79 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 16-31,80-95 1 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 0-15,64-79 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 32-47,96-111 2 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 32-47,96-111 2 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 48-63,112-127 3 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X 32-47,96-111 2 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
ulimit soft: 65535