Skip to content

[feature suggestion] instead of using random datasets, it should use real datasets #359

@asb

Description

@asb

Almost all benchmark configurations set max_model_len or for TensorRT --max_seq_len, which controls the maximum supported length of request (inclusive of the prompt and any generated output). It is typically set to ISL + OSL + tiny_margin (where tiny_margin may be 20 or 200 tokens). This is also done in generate_sweep_configs.py. These options of course impact memory allocation and so maximum achievable batch size.

It's understandable for a benchmark to be showing something approaching the peak obtainable results, but tuning the inference engine to the precise benchmark workload in this way seems to be going against the idea that "We want server configs to reflect real world deployments as much as possible" and the goal "to provide benchmarks that both emulate real world applications as much as possible and reflect the continuous pace of software innovation.". I struggle to think of a real world application of DeepSeek that would be able to run with a ~2k maximum sequence length for instance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions