Use NVIDIA GPUs

urces in your Charmed Kubeflow (CKF) deployment.

Requirements

  • A CKF deployment and access to the Kubeflow dashboard. See Get started for more details.
  • An NVIDIA GPU accessible from the Kubernetes cluster that CKF is deployed on. Depending on your deployment, refer to one of the following guides for more details:

Scheduling Patterns

Scheduling Kubeflow workload pods to nodes with specialised hardware requires specific configurations, that can vary also depending on the use-case and the environment.

The most common patterns for GPU scheduling can be broken down to:

  1. Requesting GPUs, in the Kubeflow workloads
  2. Scheduling to a specific node pool
  3. Scheduling on tainted nodes

This section will focus on giving a high level overview of the different use cases and the rest of the docs on how to configure the different Kubeflow workloads.

Requesting GPUs

In order for Kubeflow workloads to get scheduled to nodes with specialised hardware, like GPUs, the underlying workload Pods will need to request the hardware via their resources.

For this, the corresponding device plugin also needs to be installed in the K8s cluster, to ensure the Nodes with the specialised hardware have the correct drivers installed and that the underlying container runtime can utilise the hardware.

Schedule to Specific Node Pool

While setting the resources in the workload Pod will let K8s know it should schedule the Pod to a node with the requested hardware, it’s quite common for the Pod to need to be scheduled in a specialised subset of Nodes.

For example, the workload will need to be scheduled to use a GPU and also on a node that is aimed for development and not production in a specific availability zone or data center.

This is achieved by setting the underlying workload Pod’s nodeSelector or node affinities, to further narrow down the list of Nodes the Pod should be scheduled to.

Schedule on Tainted Nodes

Nodes with specialised hardware, like GPUs, are very expensive. Because of this there’s a common pattern of having autoscaling node pools with such nodes, so that if the nodes are not utilised then they will be scaled down.

To achieve the above, admins also set Taints in those Nodes, to ensure that only Pods that are configured with Tolerations can be scheduled to them. You can read more about this in the upstream K8s documentation.

Thus in this scenario Kubeflow workload Pods will also need to have Tolerations configured, to be able to get scheduled to the specialised Nodes.

Spin a Notebook on a GPU

Kubeflow Notebooks can use any GPU resource available in the Kubernetes cluster. This is configurable during the Notebook’s creation.

When creating a Notebook, under GPUs, select the number of GPUs and NVIDIA as the GPU vendor. The GPUs number depends both on the cluster setup and your code demands.

If your Notebook uses a Tensorflow-based image with CUDA, use the following code to confirm the notebooks have access to a GPU:

import tensorflow as tf
gpus = tf.config.list_physical_devices("GPU")
print(f"Congratz! The following GPUs are available to the notebook: {gpus}" if gpus else "There's no GPU available to the notebook")

In case your cluster setup uses Taints, see Leverage PodDefaults for more details.

In case you need to schedule the Notebook to a more specific Node pool, you can configure the Affinities of the Notebook. See how to configure the Notebooks web app to have somde default options for Affinities.

Run Pipeline steps on a GPU

Kubeflow Pipelines provides steps to use GPU resources available in your Kubernetes cluster. You can enable this by adding the nvidia.com/gpu: 1 limit to a step during the Pipeline’s definition. See the detailed steps below.

A GPU can be used by one Pod at a time. Thus, a Pipeline can schedule Pods on a GPU only when available. For advanced GPU sharing practices on Kubernetes, see NVIDIA Multi-Instance GPU.

  1. Open a notebook with your Pipeline. If you don’t have one, use the following code as an example. It creates a Pipeline with a single component that checks GPU access:
# Import required objects
from kfp import dsl

@dsl.component(base_image="kubeflownotebookswg/jupyter-tensorflow-cuda:v1.9.0")
def gpu_check() -> str:
    """Get the list of GPUs and print it. If empty, raise a RuntimeError."""
    import tensorflow as tf
    gpus = tf.config.list_physical_devices("GPU")
    print("GPU list:", gpus)
    if not gpus:
        raise RuntimeError("No GPU has been detected.")
    return str(len(gpus) > 0)

@dsl.pipeline
def gpu_check_pipeline() -> str:
    """Create a pipeline that runs code to check access to a GPU."""
    gpu_check_object = gpu_check()
    return gpu_check_object.output

Make sure the KFP SDK is installed in the Notebook’s environment:

!pip install "kfp>=2.4,<3.0"
  1. Ensure the step of the Pipeline’s component gpu_check runs on a GPU by creating a function add_gpu_request(task) that uses the SDK’s add_node_selector_constraint() and set_accelerator_limit(). This sets the required limit for the step’s Pod:
def add_gpu_request(task: dsl.PipelineTask) -> dsl.PipelineTask:
    """Add a request field for a GPU to the container created by the PipelineTask object."""
    return task.add_node_selector_constraint(accelerator="nvidia.com/gpu").set_accelerator_limit(
        limit=1
    )

To schedule the Task’s Pod to a Node with more specialised requirements you can modify the nodeSelector of the Task’s Pod.

You can do this with the kfp.kubernetes.add_node_selector method to add labels of your node pool that the task’s Pod should be scheduled to.

To schedule the Task’s Pod in a Node pool that has Taints you can set Tolerations in the Task’s Pod with the kfp.kubernetes.add_toleration method.

  1. Modify the Pipeline definition by calling add_gpu_request() to the component:
@dsl.pipeline
def gpu_check_pipeline() -> str:
    """Create a pipeline that runs code to check access to a GPU."""
    gpu_check_object = add_gpu_request(gpu_check())
    return gpu_check_object.output
  1. Submit and run the Pipeline:
# Submit the pipeline executes successfully
from kfp.client import Client
client = Client()
run = client.create_run_from_pipeline_func(
    gpu_check_pipeline,
    experiment_name="Check access to GPU",
    enable_caching=False,
)
  1. Navigate to the output Run details. In its logs, you can see the available GPU devices the step has access to.

Distributed Training with GPUs

Distributed training in Kubeflow is achieved via the Katib and Training Operator components.

Katib Trials can be implemented with different Job types, which might have defaults via Trial Templates, that can span K8s Jobs or Distributed Training Jobs via Training Operator.

All Trial definitions though end up configuring a PodSpec, for the Trial’s Pods definitions.

To accomodate the above use-cases for scheduling with GPUs, you’ll need to:

  1. Set the resource.limits for the GPU that should be used
  2. Potentially, configure the PodSpec’s nodeSelector
  3. Potentially, configure the PodSepc’s tolerations

The following is an example TFJob, that can be used in a Trial definition, that meets all the above criteria:

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  generateName: tfjob
  namespace: your-user-namespace
spec:
  tfReplicaSpecs:
    PS:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          nodeSelector:
            pool: pool1
          tolerations:
            - effect: NoSchedule
              key: sku
              operator: Equal
              value: pool1
          containers:
            - name: tensorflow
              image: gcr.io/your-project/your-image
              command:
                - python
                - -m
                - trainer.task
                - --batch_size=32
                - --training_steps=1000
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          nodeSelector:
            pool: pool1
          tolerations:
            - effect: NoSchedule
              key: sku
              operator: Equal
              value: pool1
          containers:
            - name: tensorflow
              image: gcr.io/your-project/your-image
              resources:
                limits:
                  nvidia.com/gpu: 1
              command:
                - python
                - -m
                - trainer.task
                - --batch_size=32
                - --training_steps=1000

Inference with a KServe ISVC on a GPU

KServe inference services (ISVC) can schedule their Pods on a GPU. To ensure the ISVC Pod is using a GPU, add the nvidia.com/gpu: 1 limit to the ISVC’s definition.

You can do so by using the kubectl Command Line Interface (CLI) or within a notebook.

Using kubectl CLI

Using the kubectl CLI, you can enable GPU usage in your InferenceService Pod by directly modifying its configuration YAML file. For example, the inference service YAML file from this example would be modified to:

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "sklearn-iris"
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
      resources:
        limits:
          nvidia.com/gpu: 1

Within a notebook

A GPU can be used by one Pod at a time. Thus, an ISVC Pod can be scheduled on a GPU only when available. For advanced GPU sharing practices on Kubernetes, see NVIDIA Multi-Instance GPU.

  1. Open a notebook with your InferenceService. If you don’t have one, use this one as an example.

Make sure the Kserve SDK is installed in the Notebook’s environment:

!pip install kserve
  1. Import V1ResourceRequirements from kubernetes.client package and add a resources field in the workload you want to run on a GPU. See the example for reference:
ISVC_NAME = "sklearn-iris"
isvc = V1beta1InferenceService(
    api_version=constants.KSERVE_V1BETA1,
    kind=constants.KSERVE_KIND,
    metadata=V1ObjectMeta(
        name=ISVC_NAME,
        annotations={"sidecar.istio.io/inject": "false"},
    ),
    spec=V1beta1InferenceServiceSpec(
        predictor=V1beta1PredictorSpec(
            sklearn=V1beta1SKLearnSpec(
                resources=V1ResourceRequirements(
                    limits= {"nvidia.com/gpu":"1"}
                ),
                storage_uri="gs://kfserving-examples/models/sklearn/1.0/model"
            )
        )
    ),
)


Last updated 10 days ago.