Skip to content

Modifying GPU requirement with replica count change did not reflect in existing replica/ didn't succeed the deployment #717

@bhks

Description

@bhks

What happened:

I have an EKS cluster with LWS operator deployed. I deployed a single replica with 8 GPU as resource requirement. When I made a change to the yaml with 2 replica and reduce my GPU requirement to 4 the existing deployed replica and the pod didn't change.

This caused the second replica to not get deployed even if the total GPU requirement was satisfied.

What you expected to happen:
The existing replica and its pods to get recreated and both replica and its pods to be deployed successfully.

How to reproduce it (as minimally and precisely as possible):

  • Should be easy to repro with first creating a single replica of model deployment and then making changes with 2 replica count and reducing GPU resource from 8 to 4 for each replica.

Anything else we need to know?:
No

Environment:

  • Kubernetes version (use kubectl version): 1.32
  • LWS version (use git describe --tags --dirty --always): v0.7.0
  • Cloud provider or hardware configuration: AWS/EKS + GPU
  • OS (e.g: cat /etc/os-release): AL2023

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions