-
Notifications
You must be signed in to change notification settings - Fork 470
feat(vllm): add vLLM integration #14732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
Bootstrap import analysisComparison of import times between this PR and base. SummaryThe average import time from this PR is: 248 ± 3 ms. The average import time from base is: 250 ± 2 ms. The import time difference between this PR and base is: -2.2 ± 0.1 ms. Import time breakdownThe following import paths have shrunk:
|
Performance SLOsComparing candidate alex/feat/vllm (58b6893) with baseline main (68a6181) 📈 Performance Regressions (3 suites)📈 iastaspects - 118/118✅ add_aspectTime: ✅ 0.400µs (SLO: <10.000µs 📉 -96.0%) vs baseline: ~same Memory: ✅ 40.280MB (SLO: <41.500MB -2.9%) vs baseline: +4.2% ✅ add_inplace_aspectTime: ✅ 0.408µs (SLO: <10.000µs 📉 -95.9%) vs baseline: -0.5% Memory: ✅ 40.441MB (SLO: <41.500MB -2.6%) vs baseline: +5.4% ✅ add_inplace_noaspectTime: ✅ 0.314µs (SLO: <10.000µs 📉 -96.9%) vs baseline: -1.9% Memory: ✅ 40.285MB (SLO: <41.500MB -2.9%) vs baseline: +4.9% ✅ add_noaspectTime: ✅ 0.277µs (SLO: <10.000µs 📉 -97.2%) vs baseline: +0.4% Memory: ✅ 40.383MB (SLO: <41.500MB -2.7%) vs baseline: +5.2% ✅ bytearray_aspectTime: ✅ 1.341µs (SLO: <10.000µs 📉 -86.6%) vs baseline: -0.3% Memory: ✅ 40.187MB (SLO: <41.500MB -3.2%) vs baseline: +5.1% ✅ bytearray_extend_aspectTime: ✅ 1.492µs (SLO: <10.000µs 📉 -85.1%) vs baseline: -0.7% Memory: ✅ 40.088MB (SLO: <41.500MB -3.4%) vs baseline: +4.1% ✅ bytearray_extend_noaspectTime: ✅ 0.608µs (SLO: <10.000µs 📉 -93.9%) vs baseline: -0.7% Memory: ✅ 40.344MB (SLO: <41.500MB -2.8%) vs baseline: +5.0% ✅ bytearray_noaspectTime: ✅ 0.482µs (SLO: <10.000µs 📉 -95.2%) vs baseline: +0.6% Memory: ✅ 40.128MB (SLO: <41.500MB -3.3%) vs baseline: +4.6% ✅ bytes_aspectTime: ✅ 1.285µs (SLO: <10.000µs 📉 -87.2%) vs baseline: -0.2% Memory: ✅ 40.036MB (SLO: <41.500MB -3.5%) vs baseline: +3.4% ✅ bytes_noaspectTime: ✅ 0.492µs (SLO: <10.000µs 📉 -95.1%) vs baseline: +0.1% Memory: ✅ 40.324MB (SLO: <41.500MB -2.8%) vs baseline: +4.8% ✅ bytesio_aspectTime: ✅ 1.330µs (SLO: <10.000µs 📉 -86.7%) vs baseline: +0.5% Memory: ✅ 40.265MB (SLO: <41.500MB -3.0%) vs baseline: +4.3% ✅ bytesio_noaspectTime: ✅ 0.495µs (SLO: <10.000µs 📉 -95.0%) vs baseline: ~same Memory: ✅ 40.108MB (SLO: <41.500MB -3.4%) vs baseline: +4.3% ✅ capitalize_aspectTime: ✅ 0.730µs (SLO: <10.000µs 📉 -92.7%) vs baseline: -0.4% Memory: ✅ 40.226MB (SLO: <41.500MB -3.1%) vs baseline: +4.9% ✅ capitalize_noaspectTime: ✅ 0.432µs (SLO: <10.000µs 📉 -95.7%) vs baseline: -1.3% Memory: ✅ 40.128MB (SLO: <41.500MB -3.3%) vs baseline: +4.6% ✅ casefold_aspectTime: ✅ 0.733µs (SLO: <10.000µs 📉 -92.7%) vs baseline: -0.1% Memory: ✅ 40.206MB (SLO: <41.500MB -3.1%) vs baseline: +4.8% ✅ casefold_noaspectTime: ✅ 0.370µs (SLO: <10.000µs 📉 -96.3%) vs baseline: -0.2% Memory: ✅ 40.187MB (SLO: <41.500MB -3.2%) vs baseline: +4.3% ✅ decode_aspectTime: ✅ 0.726µs (SLO: <10.000µs 📉 -92.7%) vs baseline: +0.3% Memory: ✅ 40.442MB (SLO: <41.500MB -2.5%) vs baseline: +5.4% ✅ decode_noaspectTime: ✅ 0.416µs (SLO: <10.000µs 📉 -95.8%) vs baseline: -1.1% Memory: ✅ 40.226MB (SLO: <41.500MB -3.1%) vs baseline: +4.8% ✅ encode_aspectTime: ✅ 0.704µs (SLO: <10.000µs 📉 -93.0%) vs baseline: ~same Memory: ✅ 40.167MB (SLO: <41.500MB -3.2%) vs baseline: +4.1% ✅ encode_noaspectTime: ✅ 0.402µs (SLO: <10.000µs 📉 -96.0%) vs baseline: +0.8% Memory: ✅ 40.246MB (SLO: <41.500MB -3.0%) vs baseline: +4.9% ✅ format_aspectTime: ✅ 3.345µs (SLO: <10.000µs 📉 -66.5%) vs baseline: -1.4% Memory: ✅ 40.226MB (SLO: <41.500MB -3.1%) vs baseline: +4.6% ✅ format_map_aspectTime: ✅ 3.501µs (SLO: <10.000µs 📉 -65.0%) vs baseline: -2.2% Memory: ✅ 40.246MB (SLO: <41.500MB -3.0%) vs baseline: +4.7% ✅ format_map_noaspectTime: ✅ 0.774µs (SLO: <10.000µs 📉 -92.3%) vs baseline: +0.5% Memory: ✅ 40.403MB (SLO: <41.500MB -2.6%) vs baseline: +5.1% ✅ format_noaspectTime: ✅ 0.592µs (SLO: <10.000µs 📉 -94.1%) vs baseline: ~same Memory: ✅ 40.108MB (SLO: <41.500MB -3.4%) vs baseline: +4.7% ✅ index_aspectTime: ✅ 0.355µs (SLO: <10.000µs 📉 -96.5%) vs baseline: +0.1% Memory: ✅ 40.338MB (SLO: <41.500MB -2.8%) vs baseline: +4.2% ✅ index_noaspectTime: ✅ 0.277µs (SLO: <10.000µs 📉 -97.2%) vs baseline: +0.7% Memory: ✅ 40.364MB (SLO: <41.500MB -2.7%) vs baseline: +5.2% ✅ join_aspectTime: ✅ 1.340µs (SLO: <10.000µs 📉 -86.6%) vs baseline: +1.9% Memory: ✅ 40.080MB (SLO: <41.500MB -3.4%) vs baseline: +3.4% ✅ join_noaspectTime: ✅ 0.487µs (SLO: <10.000µs 📉 -95.1%) vs baseline: -1.8% Memory: ✅ 40.324MB (SLO: <41.500MB -2.8%) vs baseline: +5.1% ✅ ljust_aspectTime: ✅ 2.904µs (SLO: <20.000µs 📉 -85.5%) vs baseline: 📈 +13.8% Memory: ✅ 40.285MB (SLO: <41.500MB -2.9%) vs baseline: +4.8% ✅ ljust_noaspectTime: ✅ 0.400µs (SLO: <10.000µs 📉 -96.0%) vs baseline: -0.3% Memory: ✅ 40.226MB (SLO: <41.500MB -3.1%) vs baseline: +4.4% ✅ lower_aspectTime: ✅ 2.274µs (SLO: <10.000µs 📉 -77.3%) vs baseline: +4.2% Memory: ✅ 40.204MB (SLO: <41.500MB -3.1%) vs baseline: +4.5% ✅ lower_noaspectTime: ✅ 0.369µs (SLO: <10.000µs 📉 -96.3%) vs baseline: +0.3% Memory: ✅ 40.344MB (SLO: <41.500MB -2.8%) vs baseline: +5.3% ✅ lstrip_aspectTime: ✅ 2.248µs (SLO: <20.000µs 📉 -88.8%) vs baseline: +0.7% Memory: ✅ 40.187MB (SLO: <41.500MB -3.2%) vs baseline: +4.6% ✅ lstrip_noaspectTime: ✅ 0.382µs (SLO: <10.000µs 📉 -96.2%) vs baseline: +0.5% Memory: ✅ 40.324MB (SLO: <41.500MB -2.8%) vs baseline: +5.2% ✅ modulo_aspectTime: ✅ 1.037µs (SLO: <10.000µs 📉 -89.6%) vs baseline: +3.6% Memory: ✅ 40.305MB (SLO: <41.500MB -2.9%) vs baseline: +4.1% ✅ modulo_aspect_for_bytearray_bytearrayTime: ✅ 1.542µs (SLO: <10.000µs 📉 -84.6%) vs baseline: ~same Memory: ✅ 40.226MB (SLO: <41.500MB -3.1%) vs baseline: +4.7% ✅ modulo_aspect_for_bytesTime: ✅ 0.976µs (SLO: <10.000µs 📉 -90.2%) vs baseline: +0.6% Memory: ✅ 40.128MB (SLO: <41.500MB -3.3%) vs baseline: +4.4% ✅ modulo_aspect_for_bytes_bytearrayTime: ✅ 1.244µs (SLO: <10.000µs 📉 -87.6%) vs baseline: +2.6% Memory: ✅ 40.265MB (SLO: <41.500MB -3.0%) vs baseline: +5.0% ✅ modulo_noaspectTime: ✅ 0.626µs (SLO: <10.000µs 📉 -93.7%) vs baseline: -0.1% Memory: ✅ 40.226MB (SLO: <41.500MB -3.1%) vs baseline: +4.4% ✅ replace_aspectTime: ✅ 4.821µs (SLO: <10.000µs 📉 -51.8%) vs baseline: -0.9% Memory: ✅ 40.206MB (SLO: <41.500MB -3.1%) vs baseline: +4.3% ✅ replace_noaspectTime: ✅ 0.459µs (SLO: <10.000µs 📉 -95.4%) vs baseline: -0.5% Memory: ✅ 40.383MB (SLO: <41.500MB -2.7%) vs baseline: +4.9% ✅ repr_aspectTime: ✅ 0.908µs (SLO: <10.000µs 📉 -90.9%) vs baseline: +0.6% Memory: ✅ 40.179MB (SLO: <41.500MB -3.2%) vs baseline: +3.7% ✅ repr_noaspectTime: ✅ 0.417µs (SLO: <10.000µs 📉 -95.8%) vs baseline: -0.3% Memory: ✅ 40.482MB (SLO: <41.500MB -2.5%) vs baseline: +5.4% ✅ rstrip_aspectTime: ✅ 1.931µs (SLO: <20.000µs 📉 -90.3%) vs baseline: +1.1% Memory: ✅ 40.246MB (SLO: <41.500MB -3.0%) vs baseline: +4.8% ✅ rstrip_noaspectTime: ✅ 0.380µs (SLO: <10.000µs 📉 -96.2%) vs baseline: -0.7% Memory: ✅ 40.226MB (SLO: <41.500MB -3.1%) vs baseline: +4.7% ✅ slice_aspectTime: ✅ 0.489µs (SLO: <10.000µs 📉 -95.1%) vs baseline: -0.2% Memory: ✅ 40.240MB (SLO: <41.500MB -3.0%) vs baseline: +3.9% ✅ slice_noaspectTime: ✅ 0.447µs (SLO: <10.000µs 📉 -95.5%) vs baseline: +0.4% Memory: ✅ 40.265MB (SLO: <41.500MB -3.0%) vs baseline: +5.0% ✅ stringio_aspectTime: ✅ 1.769µs (SLO: <10.000µs 📉 -82.3%) vs baseline: 📈 +15.4% Memory: ✅ 40.226MB (SLO: <41.500MB -3.1%) vs baseline: +5.3% ✅ stringio_noaspectTime: ✅ 0.713µs (SLO: <10.000µs 📉 -92.9%) vs baseline: -0.2% Memory: ✅ 40.128MB (SLO: <41.500MB -3.3%) vs baseline: +4.7% ✅ strip_aspectTime: ✅ 2.213µs (SLO: <20.000µs 📉 -88.9%) vs baseline: -0.3% Memory: ✅ 40.266MB (SLO: <41.500MB -3.0%) vs baseline: +4.9% ✅ strip_noaspectTime: ✅ 0.387µs (SLO: <10.000µs 📉 -96.1%) vs baseline: +1.3% Memory: ✅ 40.403MB (SLO: <41.500MB -2.6%) vs baseline: +4.9% ✅ swapcase_aspectTime: ✅ 2.486µs (SLO: <10.000µs 📉 -75.1%) vs baseline: +2.7% Memory: ✅ 40.147MB (SLO: <41.500MB -3.3%) vs baseline: +4.7% ✅ swapcase_noaspectTime: ✅ 0.536µs (SLO: <10.000µs 📉 -94.6%) vs baseline: +0.2% Memory: ✅ 40.324MB (SLO: <41.500MB -2.8%) vs baseline: +5.1% ✅ title_aspectTime: ✅ 2.406µs (SLO: <10.000µs 📉 -75.9%) vs baseline: +2.5% Memory: ✅ 40.344MB (SLO: <41.500MB -2.8%) vs baseline: +5.2% ✅ title_noaspectTime: ✅ 0.502µs (SLO: <10.000µs 📉 -95.0%) vs baseline: +0.7% Memory: ✅ 40.226MB (SLO: <41.500MB -3.1%) vs baseline: +4.3% ✅ translate_aspectTime: ✅ 3.221µs (SLO: <10.000µs 📉 -67.8%) vs baseline: +0.2% Memory: ✅ 40.246MB (SLO: <41.500MB -3.0%) vs baseline: +4.7% ✅ translate_noaspectTime: ✅ 1.043µs (SLO: <10.000µs 📉 -89.6%) vs baseline: ~same Memory: ✅ 40.187MB (SLO: <41.500MB -3.2%) vs baseline: +4.9% ✅ upper_aspectTime: ✅ 2.281µs (SLO: <10.000µs 📉 -77.2%) vs baseline: +3.3% Memory: ✅ 40.202MB (SLO: <41.500MB -3.1%) vs baseline: +4.8% ✅ upper_noaspectTime: ✅ 0.367µs (SLO: <10.000µs 📉 -96.3%) vs baseline: -0.7% Memory: ✅ 40.187MB (SLO: <41.500MB -3.2%) vs baseline: +4.2% 📈 iastaspectsospath - 24/24✅ ospathbasename_aspectTime: ✅ 5.008µs (SLO: <10.000µs 📉 -49.9%) vs baseline: 📈 +27.8% Memory: ✅ 40.344MB (SLO: <41.000MB 🟡 -1.6%) vs baseline: +5.0% ✅ ospathbasename_noaspectTime: ✅ 1.074µs (SLO: <10.000µs 📉 -89.3%) vs baseline: -0.5% Memory: ✅ 40.206MB (SLO: <41.000MB 🟡 -1.9%) vs baseline: +4.7% ✅ ospathjoin_aspectTime: ✅ 5.949µs (SLO: <10.000µs 📉 -40.5%) vs baseline: -0.4% Memory: ✅ 40.167MB (SLO: <41.000MB -2.0%) vs baseline: +4.9% ✅ ospathjoin_noaspectTime: ✅ 2.281µs (SLO: <10.000µs 📉 -77.2%) vs baseline: ~same Memory: ✅ 40.167MB (SLO: <41.000MB -2.0%) vs baseline: +4.5% ✅ ospathnormcase_aspectTime: ✅ 3.245µs (SLO: <10.000µs 📉 -67.6%) vs baseline: -0.1% Memory: ✅ 40.324MB (SLO: <41.000MB 🟡 -1.6%) vs baseline: +4.7% ✅ ospathnormcase_noaspectTime: ✅ 0.564µs (SLO: <10.000µs 📉 -94.4%) vs baseline: -0.1% Memory: ✅ 40.226MB (SLO: <41.000MB 🟡 -1.9%) vs baseline: +5.0% ✅ ospathsplit_aspectTime: ✅ 4.473µs (SLO: <10.000µs 📉 -55.3%) vs baseline: -1.1% Memory: ✅ 40.383MB (SLO: <41.000MB 🟡 -1.5%) vs baseline: +5.2% ✅ ospathsplit_noaspectTime: ✅ 1.576µs (SLO: <10.000µs 📉 -84.2%) vs baseline: -0.4% Memory: ✅ 40.226MB (SLO: <41.000MB 🟡 -1.9%) vs baseline: +4.5% ✅ ospathsplitdrive_aspectTime: ✅ 3.385µs (SLO: <10.000µs 📉 -66.1%) vs baseline: -0.5% Memory: ✅ 40.147MB (SLO: <41.000MB -2.1%) vs baseline: +4.7% ✅ ospathsplitdrive_noaspectTime: ✅ 0.689µs (SLO: <10.000µs 📉 -93.1%) vs baseline: -0.5% Memory: ✅ 40.383MB (SLO: <41.000MB 🟡 -1.5%) vs baseline: +5.1% ✅ ospathsplitext_aspectTime: ✅ 4.317µs (SLO: <10.000µs 📉 -56.8%) vs baseline: +1.5% Memory: ✅ 40.226MB (SLO: <41.000MB 🟡 -1.9%) vs baseline: +4.5% ✅ ospathsplitext_noaspectTime: ✅ 1.380µs (SLO: <10.000µs 📉 -86.2%) vs baseline: +0.3% Memory: ✅ 40.246MB (SLO: <41.000MB 🟡 -1.8%) vs baseline: +4.6% 📈 telemetryaddmetric - 30/30✅ 1-count-metric-1-timesTime: ✅ 3.407µs (SLO: <20.000µs 📉 -83.0%) vs baseline: 📈 +16.0% Memory: ✅ 34.701MB (SLO: <35.500MB -2.2%) vs baseline: +4.7% ✅ 1-count-metrics-100-timesTime: ✅ 203.395µs (SLO: <220.000µs -7.5%) vs baseline: -0.1% Memory: ✅ 34.878MB (SLO: <35.500MB 🟡 -1.8%) vs baseline: +5.0% ✅ 1-distribution-metric-1-timesTime: ✅ 3.321µs (SLO: <20.000µs 📉 -83.4%) vs baseline: +1.2% Memory: ✅ 34.741MB (SLO: <35.500MB -2.1%) vs baseline: +4.9% ✅ 1-distribution-metrics-100-timesTime: ✅ 219.755µs (SLO: <230.000µs -4.5%) vs baseline: +0.7% Memory: ✅ 34.760MB (SLO: <35.500MB -2.1%) vs baseline: +4.9% ✅ 1-gauge-metric-1-timesTime: ✅ 2.179µs (SLO: <20.000µs 📉 -89.1%) vs baseline: +0.2% Memory: ✅ 34.839MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.9% ✅ 1-gauge-metrics-100-timesTime: ✅ 137.086µs (SLO: <150.000µs -8.6%) vs baseline: ~same Memory: ✅ 34.741MB (SLO: <35.500MB -2.1%) vs baseline: +4.7% ✅ 1-rate-metric-1-timesTime: ✅ 3.109µs (SLO: <20.000µs 📉 -84.5%) vs baseline: ~same Memory: ✅ 34.819MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.9% ✅ 1-rate-metrics-100-timesTime: ✅ 218.372µs (SLO: <250.000µs 📉 -12.7%) vs baseline: +1.7% Memory: ✅ 34.800MB (SLO: <35.500MB 🟡 -2.0%) vs baseline: +4.6% ✅ 100-count-metrics-100-timesTime: ✅ 20.481ms (SLO: <22.000ms -6.9%) vs baseline: +0.2% Memory: ✅ 34.780MB (SLO: <35.500MB -2.0%) vs baseline: +4.8% ✅ 100-distribution-metrics-100-timesTime: ✅ 2.270ms (SLO: <2.550ms 📉 -11.0%) vs baseline: -1.9% Memory: ✅ 34.819MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.7% ✅ 100-gauge-metrics-100-timesTime: ✅ 1.419ms (SLO: <1.550ms -8.5%) vs baseline: +1.0% Memory: ✅ 34.839MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +5.0% ✅ 100-rate-metrics-100-timesTime: ✅ 2.212ms (SLO: <2.550ms 📉 -13.3%) vs baseline: -0.2% Memory: ✅ 34.898MB (SLO: <35.500MB 🟡 -1.7%) vs baseline: +5.4% ✅ flush-1-metricTime: ✅ 4.595µs (SLO: <20.000µs 📉 -77.0%) vs baseline: -0.6% Memory: ✅ 34.819MB (SLO: <35.500MB 🟡 -1.9%) vs baseline: +4.3% ✅ flush-100-metricsTime: ✅ 173.994µs (SLO: <250.000µs 📉 -30.4%) vs baseline: -0.6% Memory: ✅ 35.134MB (SLO: <35.500MB 🟡 -1.0%) vs baseline: +4.6% ✅ flush-1000-metricsTime: ✅ 2.178ms (SLO: <2.500ms 📉 -12.9%) vs baseline: -0.8% Memory: ✅ 36.019MB (SLO: <36.500MB 🟡 -1.3%) vs baseline: +5.0% 🟡 Near SLO Breach (16 suites)🟡 coreapiscenario - 10/10 (1 unstable)
|
bf30414 to
0af046e
Compare
5627244 to
494f936
Compare
d970650 to
2c22b68
Compare
|
@PROFeNoM probably worth updating the codeowners file as well to make llmobs the owner of this integration, will help require less people to review it (after the codeowners change is merged) |
23026f8 to
e64073f
Compare
|
This pull request has been automatically closed after a period of inactivity. |
# Conflicts: # .gitlab/testrunner.yml # scripts/ddtest # tests/llmobs/suitespec.yml
# Conflicts: # tests/llmobs/suitespec.yml
# Conflicts: # ddtrace/llmobs/_constants.py
…d improve span creation logic - Updated `traced_output_processor_process_outputs` to capture `req_state` data for all requests, not just those marked as finished. - Improved span creation logic to ensure spans are only created for requests that have actually finished processing. - Added handling for `iteration_stats` to provide additional context in spans. - Cleaned up comments for clarity and accuracy regarding request state handling. # Conflicts: # ddtrace/llmobs/_constants.py
…h wrapt proxies - Introduced `_register_wrapt_pickle_reducers` to register custom pickle reducers for wrapt proxy types. - This enables serialization of ddtrace-wrapped objects in frameworks like Ray that utilize cloudpickle. - The new reducer unwraps proxies to their underlying objects, allowing for re-patching on deserialization. - Called the new function in `_patch_all` to ensure the reducers are registered during the patching process.
- Simplified the `_register_wrapt_pickle_reducers` function to prevent multiple registrations by using a global flag. - Removed redundant comments and improved code clarity while maintaining functionality for serializing ddtrace-wrapped objects. - Ensured that the registration of pickle reducers occurs only once to enhance performance and avoid unnecessary overhead.
- Introduced `parse_prompt_to_messages` to convert formatted prompts into structured messages, supporting various chat templates. - Added role extraction patterns for common chat formats to improve message handling. - Updated `VLLMIntegration` to utilize the new message parsing function for input messages. - Refactored tests to align with the new message structure, ensuring consistency in input and output message formats.
- Updated role extraction patterns to support additional chat templates, including Llama 4, Granite, Gemma, and others. - Improved the `parse_prompt_to_messages` function to utilize quick checks for markers, enhancing performance and accuracy in message parsing. - Added comprehensive tests for various prompt formats to ensure robust handling of different message structures and roles.
# Conflicts: # ddtrace/llmobs/_constants.py
…y metrics - Refactored GPU test configurations in `.gitlab/testrunner.yml` and `.gitlab/tests.yml` to utilize shared templates for improved maintainability. - Removed redundant GPU variant definitions and consolidated before scripts. - Enhanced latency metrics tracking in `vllm` integration by adding `set_latency_metrics` to capture detailed performance data. - Updated test snapshots to reflect changes in latency metrics and ensure consistency across tests.
5c940b1 to
0beb49a
Compare
|
@codex review |
|
Codex Review: Didn't find any major issues. What shall we delve into next? ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Description
MLOB-4847
This PR adds Datadog tracing integration for vLLM V1 engine exclusively. V0 is deprecated and being removed (vLLM Q3 2025 Roadmap), so we're building for the future.
Request Flow and Instrumentation Points
The integration traces at the engine level rather than wrapping high-level APIs. This gives us a single integration point for all operations (completion, chat, embedding, classification) with complete access to internal metadata.
1. Engine Initialization (once per engine)
2. Request Submission (per request)
3. Output Processing (when request finishes)
The key insight:
OutputProcessor.process_outputshas everything in one place: request metadata, output data, and parent context. We wrap three specific points because each serves a distinct purpose:__init__for setup,process_inputsfor context injection,process_outputsfor span creation.Version Support
Requires vLLM >= 0.10.2 for V1 support. Version 0.10.2 includes vLLM PR #20372 which added
trace_headersfor context propagation.No V0 support. It's deprecated and being removed. The integration includes a version check that gracefully skips instrumentation on older versions with a warning.
Metadata Captured
For chat requests where vLLM only stores token IDs, we decode back to text using the tokenizer to ensure
input_messagesare captured correctly.Chat Template Parsing
For chat completions, vLLM applies Jinja2 templates to format messages. We parse the formatted prompt back into structured
input_messagesfor LLMObs.Supported formats: Llama 3/4, ChatML/Qwen, Phi, DeepSeek, Gemma, Granite, MiniMax, TeleFLM, Inkbot, Alpaca, Falcon. Chosen because they're visible as examples in vLLM repos. Fallback: raw prompt.
Parser uses quick marker detection before regex patterns, avoiding unnecessary regex execution. Prompts decoded with
skip_special_tokens=Falseto preserve chat template markers (vLLM defaults strip them).Not perfect, but simple enough that adding new templates isn't painful.
FastAPI Pickle Fix for Ray Serve Compatibility
Problem
vLLM's distributed inference (via Ray Serve) serializes FastAPI app components using pickle. When dd-trace-py instruments FastAPI with
wrapt.FunctionWrapper, these wrapped objects become unpicklable because wrapt doesn't implement__reduce_ex__()by default.Solution
We register custom pickle reducers for wrapt proxy types in
fastapi/patch.py:_reduce_wrapt_proxy()unwraps the object_identity()returns the unwrapped objectThis is acceptable because distributed vLLM workers independently instrument their FastAPI instances when dd-trace-py is imported. The registration is guarded by
_WRAPT_REDUCERS_REGISTEREDflag (only runs once globally).Why This Works
@serve.ingress(app)decorator pickles the FastAPI appcloudpickleencounterswrapt.FunctionWrapperobjects (ddtrace wrappers)wraptraisesNotImplementedErrorfor__reduce_ex__()copyregintercepts via dispatch table and uses our reducerReproducer
Without the fix, this crashes with ddtrace-run:
Run with
ddtrace-run python repro.py→ crashes without fix, works with fix.Testing
Tests run on GPU hardware using
gpu:a10-amd64runner tag in GitLab CI (GPU Runners docs). Cannot be run locally on Macs. Requires actual GPU hardware. During dev, I used ag6.8xlargeEC2 instance.Coverage:
Tests converge on same instrumentation points (as shown in request flow), so current coverage should be solid for first release.
Infrastructure notes:
Risks
V1 maturity: V1 is production-ready but still evolving toward vLLM 1.0. Our instrumentation points (
process_inputs,process_outputs) are core to V1's design and unlikely to change significantly.No V0 support: Customers on V0 won't get tracing. However, V0 is deprecated and most production deployments have migrated (V0 doesn't support pooling models anymore).
Version requirement: Requiring 0.10.2+ may exclude some users, but trace header propagation is essential to a maintainable design.
High span burst in RAG scenarios: RAG apps indexing large document collections generate significant span volumes (e.g., 1000 docs = 1000 embedding spans). This is expected behavior but may impact trace readability and ingestion costs. Could add
DD_VLLM_TRACE_EMBEDDINGS=falseconfig later if needed, but let's monitor customer feedback first rather than over-engineer.Additional Notes
Main Files
patch.py: Wraps vLLM engine methodsextractors.py: Extracts request/response data from vLLM structuresutils.py: Span creation, context injection, metrics utilitiesllmobs/_integrations/vllm.py: LLMObs-specific tagging and event building