Skip to content

Conversation

@rahulait
Copy link
Contributor

@rahulait rahulait commented Dec 25, 2025

Dependencies

Depends on: NVIDIA/gpu-driver-container#529
Depends on: NVIDIA/k8s-device-plugin#1550
Depends on: NVIDIA/k8s-driver-manager#147

Description

Problem

GPU Operator supports deploying multiple driver versions within a single Kubernetes cluster through the use of multiple NvidiaDriver custom resources (CRs). However, despite supporting multiple driver instances, the GPU Operator currently deploys only a single, cluster-wide NVIDIA Container Toolkit DaemonSet and a single NVIDIA Device Plugin DaemonSet.
This architecture introduces a limitation when different NvidiaDriver CRs enable different driver-dependent features - such as GPUDirect Storage (GDS), GDRCopy, or other optional components. Because the Container Toolkit and Device Plugin are deployed once per cluster and configured uniformly, they cannot be tailored to account for feature differences across driver instances. As a result, nodes running drivers with differing enabled features cannot be correctly or independently supported.

Proposed solution

During reconciliation in the GPU Operator, we will inject additional driver-enablement environment variables into the nvidia-driver container based on the ClusterPolicy or NvidiaDriver CR selected for the node. The driver container will then persist these variables to the host filesystem on which it runs.
With this mechanism, each node will record a node-local view of enabled additional drivers, accurately reflecting the features configured for that node via its ClusterPolicy or NvidiaDriver CR.

We are updating the gpu-operator's driver validation logic where it will now wait for all enabled drivers to be installed first before proceeding.

Nvidia device-plugin is already resilient to missing devices or drivers and does not crash if a particular device is not present on the node. We are now updating device-plugin to always attempt discovery for all supported devices and driver features.

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)

Testing

  • Unit tests (make coverage)
  • Manual cluster testing (describe below)
  • N/A or Other (docs, CI config, etc.)

Test details:
Manual testing done to validate the changes.

To test with clusterpolicy, following values.yaml was used:

driver:
  enabled: true
  nvidiaDriverCRD:
    enabled: false
    deployDefaultCR: false
  kernelModuleType: open
  repository: rahulsharm810
  image: driver
  version: 580.105.08
  imagePullPolicy: Always
  rdma:
    enabled: false
    useHostMofed: false
gds:
  enabled: true
  repository: nvcr.io/nvidia/cloud-native
  image: nvidia-fs
  version: "2.26.6"
  imagePullPolicy: IfNotPresent
gdrcopy:
  enabled: true
  repository: nvcr.io/nvidia/cloud-native
  image: gdrdrv
  version: "v2.5.1"
  imagePullPolicy: Always
operator:
  repository: rahulsharm810
  image: gpu-operator
  version: nvd1
  imagePullPolicy: Always
devicePlugin:
  repository: docker.io/rahulsharm810
  image: k8s-device-plugin
  version: nvd1
  imagePullPolicy: Always
cdi:
  enabled: false
validator:
  repository: rahulsharm810
  image: gpu-operator
  version: nvd1
  imagePullPolicy: Always
manager:
  repository: rahulsharm810
  image: k8s-driver-manager
  version: nvd1
  imagePullPolicy: Always

Pods after install:

root@test:~# kgpo
NAME                                                       READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-rdcrd                                1/1     Running     0          8m54s
gpu-operator-6457b8f76d-ldm8g                              1/1     Running     0          9m20s
nvidia-container-toolkit-daemonset-v72cb                   1/1     Running     0          8m54s
nvidia-cuda-validator-6hgln                                0/1     Completed   0          6m36s
nvidia-dcgm-exporter-6f86g                                 1/1     Running     0          8m54s
nvidia-device-plugin-daemonset-7pslg                       1/1     Running     0          8m54s
nvidia-driver-daemonset-kltm9                              3/3     Running     0          9m1s
nvidia-mig-manager-62vnq                                   1/1     Running     0          8m54s
nvidia-operator-validator-7fscv                            1/1     Running     0          8m54s
nvidiagpu-node-feature-discovery-gc-6d484cd547-sfgd5       1/1     Running     0          9m20s
nvidiagpu-node-feature-discovery-master-7d466cdd75-mg6nq   1/1     Running     0          9m20s
nvidiagpu-node-feature-discovery-worker-ltv95              1/1     Running     0          9m20s
root@test:~# cat /run/nvidia/driver/.additional-drivers-flags
GDRCOPY_ENABLED: true
GDS_ENABLED: true
GPU_DIRECT_RDMA_ENABLED: false
root@test:~#

Testing with nvidiadriver CR:

values.yaml file:

driver:
  enabled: true
  nvidiaDriverCRD:
    enabled: false
    deployDefaultCR: false
operator:
  repository: rahulsharm810
  image: gpu-operator
  version: nvd1
  imagePullPolicy: Always
devicePlugin:
  repository: docker.io/rahulsharm810
  image: k8s-device-plugin
  version: nvd1
  imagePullPolicy: Always
cdi:
  enabled: false
validator:
  repository: rahulsharm810
  image: gpu-operator
  version: nvd1
  imagePullPolicy: Always

nvidiadriver CRD installed using:

kind: NVIDIADriver
metadata:
  name: demo-test
spec:
  driverType: gpu
  gdrcopy:
    enabled: true
    repository: nvcr.io/nvidia/cloud-native
    image: gdrdrv
    version: v2.5.1
    imagePullPolicy: IfNotPresent
    imagePullSecrets: []
    env: []
    args: []
  kernelModuleType: open
  rdma:
    enabled: false
    useHostMofed: false
  gds:
    enabled: false
    repository: nvcr.io/nvidia/cloud-native
    image: nvidia-fs
    version: "2.26.6"
    imagePullPolicy: IfNotPresent
  startupProbe:
    failureThreshold: 120
    initialDelaySeconds: 60
    periodSeconds: 10
    timeoutSeconds: 60
  image: driver
  repository: rahulsharm810
  imagePullPolicy: Always
  version: 580.105.08
  usePrecompiled: false
  manager:
    repository: rahulsharm810
    image: k8s-driver-manager
    version: nvd1
    imagePullPolicy: Always

Status after install:

root@test:~# kgpo
NAME                                                       READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-j8nlg                                1/1     Running     0          34m
gpu-operator-6457b8f76d-vtzbj                              1/1     Running     0          36m
nvidia-container-toolkit-daemonset-9hzvt                   1/1     Running     0          34m
nvidia-cuda-validator-8h769                                0/1     Completed   0          33m
nvidia-dcgm-exporter-2rzzf                                 1/1     Running     0          34m
nvidia-device-plugin-daemonset-v7fzj                       1/1     Running     0          34m
nvidia-gpu-driver-ubuntu24.04-6585477fb6-c4pm2             2/2     Running     0          35m
nvidia-mig-manager-s7m5q                                   1/1     Running     0          32m
nvidia-operator-validator-4sr4t                            1/1     Running     0          34m
nvidiagpu-node-feature-discovery-gc-6d484cd547-bc8k6       1/1     Running     0          36m
nvidiagpu-node-feature-discovery-master-7d466cdd75-vfqg2   1/1     Running     0          36m
nvidiagpu-node-feature-discovery-worker-cx8r9              1/1     Running     0          36m
root@test:~# cat /run/nvidia/driver/.additional-drivers-flags
GDRCOPY_ENABLED: true
GDS_ENABLED: false
GPU_DIRECT_RDMA_ENABLED: false
root@test:~#

CDI was enabled/disabled in both the tests to make sure it works with/without CDI.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 25, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@rahulait
Copy link
Contributor Author

rahulait commented Jan 7, 2026

/ok to test 4457fca

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant