-
Notifications
You must be signed in to change notification settings - Fork 69
Description
Almost all benchmark configurations set max_model_len or for TensorRT --max_seq_len, which controls the maximum supported length of request (inclusive of the prompt and any generated output). It is typically set to ISL + OSL + tiny_margin (where tiny_margin may be 20 or 200 tokens). This is also done in generate_sweep_configs.py. These options of course impact memory allocation and so maximum achievable batch size.
It's understandable for a benchmark to be showing something approaching the peak obtainable results, but tuning the inference engine to the precise benchmark workload in this way seems to be going against the idea that "We want server configs to reflect real world deployments as much as possible" and the goal "to provide benchmarks that both emulate real world applications as much as possible and reflect the continuous pace of software innovation.". I struggle to think of a real world application of DeepSeek that would be able to run with a ~2k maximum sequence length for instance.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status