Clarification on Runtime Structure (vLLM/llama.cpp, CUDA Images) #2002
Unanswered
HannesDampft
asked this question in
Q&A
Replies: 1 comment
-
|
Hello! I'm not sure I understood the question correctly, but i'll try to anwer anyway, mybe I'll ve lucky, In the git repo i found this: https://github.com/containers/ramalama/blob/main/container-images/cuda-vllm/Containerfile |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone,
we are currently using Ramalama to host LLM workloads in Kubernetes, where each instance is isolated per pod. At the moment, we only perform inference with a single GPU and exclusively use the llama.cpp runtime.
As we now plan to transition to multi‑GPU deployments, our goal is to adopt vLLM as the backend, given its superior performance with tensor parallelism. However, I’ve encountered some uncertainty regarding how Ramalama structures its container images.
Specifically:
Could someone clarify how the runtime selection and container layering are intended to work? I’d be happy to assist in improving the documentation around this area and can prepare a PR for review if that would be helpful.
Thanks in advance for any insights!
Hans
Beta Was this translation helpful? Give feedback.
All reactions