Clarification on Runtime Structure (vLLM/llama.cpp, CUDA Images) #2002

HannesDampft · 2025-10-04T15:23:48Z

HannesDampft
Oct 4, 2025

Hello everyone,

we are currently using Ramalama to host LLM workloads in Kubernetes, where each instance is isolated per pod. At the moment, we only perform inference with a single GPU and exclusively use the llama.cpp runtime.

As we now plan to transition to multi‑GPU deployments, our goal is to adopt vLLM as the backend, given its superior performance with tensor parallelism. However, I’ve encountered some uncertainty regarding how Ramalama structures its container images.

Specifically:

There appear to be several related images, but the cuda‑vllm image is not published on quay.io .
The ramalama serve command is currently failing in our test setup. (supposedly fixed in Inference engine spec #1959)
It’s unclear to me how Ramalama is designed to provide and switch between different inference runtimes inside the container.

Could someone clarify how the runtime selection and container layering are intended to work? I’d be happy to assist in improving the documentation around this area and can prepare a PR for review if that would be helpful.

Thanks in advance for any insights!
Hans

federicofortini · 2025-11-03T10:46:58Z

federicofortini
Nov 3, 2025

Hello! I'm not sure I understood the question correctly, but i'll try to anwer anyway, mybe I'll ve lucky, In the git repo i found this: https://github.com/containers/ramalama/blob/main/container-images/cuda-vllm/Containerfile
could this help?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarification on Runtime Structure (vLLM/llama.cpp, CUDA Images) #2002

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Clarification on Runtime Structure (vLLM/llama.cpp, CUDA Images) #2002

Uh oh!

HannesDampft Oct 4, 2025

Replies: 1 comment

Uh oh!

federicofortini Nov 3, 2025

HannesDampft
Oct 4, 2025

federicofortini
Nov 3, 2025