[feature suggestion] instead of using random datasets, it should use real datasets

Almost all benchmark configurations set `max_model_len` or for TensorRT `--max_seq_len`, which controls the maximum supported length of request (inclusive of the prompt and any generated output). It is typically set to [ISL + OSL + tiny_margin](https://github.com/InferenceMAX/InferenceMAX/blob/84320a0aadacae1114265b553830f48b56231817/benchmarks/gptoss_fp4_b200_docker.sh#L22) (where tiny_margin may be 20 or 200 tokens). This is also [done in generate_sweep_configs.py](https://github.com/InferenceMAX/InferenceMAX/blob/84320a0aadacae1114265b553830f48b56231817/utils/matrix_logic/generate_sweep_configs.py). These options of course impact memory allocation and so maximum achievable batch size.

It's understandable for a benchmark to be showing something approaching the peak obtainable results, but tuning the inference engine to the precise benchmark workload in this way seems to be going against the idea that "[We want server configs to reflect real world deployments as much as possible](https://newsletter.semianalysis.com/p/inferencemax-open-source-inference)" and the goal "to provide benchmarks that both emulate real world applications as much as possible and reflect the continuous pace of software innovation.". I struggle to think of a real world application of DeepSeek that would be able to run with a ~2k maximum sequence length for instance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feature suggestion] instead of using random datasets, it should use real datasets #359

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[feature suggestion] instead of using random datasets, it should use real datasets #359

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions