What happened:
I have an EKS cluster with LWS operator deployed. I deployed a single replica with 8 GPU as resource requirement. When I made a change to the yaml with 2 replica and reduce my GPU requirement to 4 the existing deployed replica and the pod didn't change.
This caused the second replica to not get deployed even if the total GPU requirement was satisfied.
What you expected to happen:
The existing replica and its pods to get recreated and both replica and its pods to be deployed successfully.
How to reproduce it (as minimally and precisely as possible):
- Should be easy to repro with first creating a single replica of model deployment and then making changes with 2 replica count and reducing GPU resource from 8 to 4 for each replica.
Anything else we need to know?:
No
Environment:
- Kubernetes version (use
kubectl version): 1.32
- LWS version (use
git describe --tags --dirty --always): v0.7.0
- Cloud provider or hardware configuration: AWS/EKS + GPU
- OS (e.g:
cat /etc/os-release): AL2023