How to set up Charmed Kubeflow on NVIDIA DGX

NVIDIA DGX systems are purpose-built hardware for enterprise AI use cases. These platforms feature NVIDIA Tensor Core GPUs, which vastly outperform traditional CPUs for machine learning workloads, alongside advanced networking and storage capabilities.

This guide contains setup instructions for running Charmed Kubeflow on single-node Nvidia’s DGX-enabled hardware. It supports both single-node and multi-node environments, having a couple of examples of how to use two components: Jupyter Notebooks and Kubeflow Pipelines.

Requirements:

  • Nvidia DGX-enabled hardware setup with correctly configured/updated BIOS settings, bootloader, OS, drivers, and packages (sample setup instructions provided below).
  • Familiarity with Python, Docker, Jupyter notebooks.
  • Tools: juju, kubectl
Sample Ubuntu and Grub setup

NOTE: The following setup instructions are given only as an example. There is no guarantee that they will be sufficient for all environments. Contact hardware distributor for more details on specific system setup.

Ubuntu Setup

This document was tested on Ubuntu 20.04 vanilla

Hint: please make sure you don’t have any Nvidia drivers preinstalled. You can do that with following steps:

Check for apt packages (if empty you OK) :

$ sudo apt list --installed | grep nvidia

If there are some packages presented you can try to remove them with

$ sudo apt remove <package-name>
$ sudo apt autoremove

Check for kernel modules (if empty you OK):

$ lsmod | grep nvidia

If there are some presented you can try to remove them with

sudo modprobe -r <module-name>

Reboot

$ sudo reboot

Grub Setup

Edit /etc/default/grub and add the following options to

GRUB_CMDLINE_LINUX_DEFAULT: modprobe.blacklist=nouveau nouveau.modeset=
$ sudo reboot

Contents:

Install Kubernetes (MicroK8s)

Install microk8s and enable required addons

$ sudo snap install microk8s --classic --channel 1.22
 
$ sudo microk8s enable dns:10.229.32.21 storage ingress registry rbac helm3 metallb:10.64.140.43-10.64.140.49,192.168.0.105-192.168.0.111
 
$ sudo usermod -a -G microk8s ubuntu
$ sudo chown -f -R ubuntu ~/.kube
$ newgrp microk8s

Edit /var/snap/microk8s/current/args/containerd-template.toml. Add:

[plugins."io.containerd.grpc.v1.cri".registry.configs]

[plugins."io.containerd.grpc.v1.cri".registry.configs."registry-1.docker.io".auth]
username = "afrikha"
password = "<>"
$ microk8s.stop; microk8s.start

Enable GPU add-on and configure MIG

Install GPU operator

$ sudo microk8s.enable gpu
$ mkdir .kube
$ microk8s config > ~/.kube/config

Check gpu count for k8s

$ kubectl get nodes --show-labels | grep gpu.count

Configure MIG devices

$ kubectl label nodes blanka nvidia.com/mig.config=all-1g.5gb --overwrite

Recheck gpu count (should be increased)

kubectl get nodes --show-labels | grep gpu.count

Troubleshooting: If none of the nodes appear in get nodes command please makes ure to uninstall all GPU drivers form kubernetes nodes and reinstall the microk8s.

Deploy Charmed Kubeflow

Follow the instructions from How to install Charmed Kubeflow to install Charmed Kubeflow.

Once installed, you can Login to Charmed Kubeflow with this guide.

When using Microk8s, sometimes applications may go into an error state because of `“error”: “too many open files” or similar. If you see this, see these suggested fixes Wait before all components are available.

To ensure Charmed Kubeflow is ready to be used, run this command

nice -n 16 watch -n 1 -c juju status --relations --color

Expected state:

App                        Version                    Status   Scale  Charm                    Channel         Rev  Address         Exposed  Message
admission-webhook          res:oci-image@84a4d7d      active       1  admission-webhook        1.6/stable       50  10.152.183.98   no       
argo-controller            res:oci-image@669ebd5      active       1  argo-controller          3.3/stable       99                  no       
argo-server                res:oci-image@576d038      active       1  argo-server              3.3/stable       45                  no       
dex-auth                                              active       1  dex-auth                 2.31/stable     129  10.152.183.147  no       
istio-ingressgateway                                  active       1  istio-gateway            1.11/stable     114  10.152.183.29   no       
istio-pilot                                           active       1  istio-pilot              1.11/stable     131  10.152.183.104  no       
jupyter-controller         res:oci-image@8f4ec33      active       1  jupyter-controller       1.6/stable      138                  no       
jupyter-ui                 res:oci-image@cde6632      active       1  jupyter-ui               1.6/stable       99  10.152.183.222  no       
katib-controller           res:oci-image@03d47fb      active       1  katib-controller         0.14/stable      92  10.152.183.220  no       
katib-db                   mariadb/server:10.3        active       1  charmed-osm-mariadb-k8s  latest/stable    35  10.152.183.23   no       ready
katib-db-manager           res:oci-image@16b33a5      active       1  katib-db-manager         0.14/stable      66  10.152.183.25   no       
katib-ui                   res:oci-image@c7dc04a      active       1  katib-ui                 0.14/stable      90  10.152.183.197  no       
kfp-api                    res:oci-image@1b44753      active       1  kfp-api                  2.0/stable       81  10.152.183.8    no       
kfp-db                     mariadb/server:10.3        active       1  charmed-osm-mariadb-k8s  latest/stable    35  10.152.183.148  no       ready
kfp-persistence            res:oci-image@31f08ad      active       1  kfp-persistence          2.0/stable       76                  no       
kfp-profile-controller     res:oci-image@d86ecff      active       1  kfp-profile-controller   2.0/stable       61  10.152.183.173  no       
kfp-schedwf                res:oci-image@51ffc60      active       1  kfp-schedwf              2.0/stable       80                  no       
kfp-ui                     res:oci-image@55148fd      active       1  kfp-ui                   2.0/stable       80  10.152.183.146  no       
kfp-viewer                 res:oci-image@7190aa3      active       1  kfp-viewer               2.0/stable       79                  no       
kfp-viz                    res:oci-image@67e8b09      active       1  kfp-viz                  2.0/stable       74  10.152.183.16   no       
kubeflow-dashboard         res:oci-image@6fe6eec      active       1  kubeflow-dashboard       1.6/stable      154  10.152.183.61   no       
kubeflow-profiles          res:profile-image@0a46ffc  active       1  kubeflow-profiles        1.6/stable       82  10.152.183.204  no       
kubeflow-roles                                        active       1  kubeflow-roles           1.6/stable       31  10.152.183.141  no       
kubeflow-volumes           res:oci-image@cc5177a      active       1  kubeflow-volumes         1.6/stable       64  10.152.183.100  no       
metacontroller-operator                               active       1  metacontroller-operator  2.0/stable       48  10.152.183.169  no       
minio                      res:oci-image@1755999      active       1  minio                    ckf-1.6/stable   99  10.152.183.193  no              
oidc-gatekeeper            res:oci-image@32de216      active       1  oidc-gatekeeper          ckf-1.6/stable   76  10.152.183.194  no       
seldon-controller-manager  res:oci-image@eb811b6      active       1  seldon-core              1.14/stable      92  10.152.183.221  no       
tensorboard-controller     res:oci-image@667e455      active       1  tensorboard-controller   1.6/stable       56  10.152.183.232  no       
tensorboards-web-app       res:oci-image@914a8ab      active       1  tensorboards-web-app     1.6/stable       57  10.152.183.174  no       
training-operator                                     active       1  training-operator        1.5/stable       65  10.152.183.14   no 

Unit                          Workload  Agent  Address      Ports              Message
admission-webhook/0*          active    idle   10.1.19.32   4443/TCP           
argo-controller/0*            active    idle   10.1.19.82                      
argo-server/0*                active    idle   10.1.19.53   2746/TCP           
dex-auth/0*                   active    idle   10.1.19.56                      
istio-ingressgateway/0*       active    idle   10.1.19.28                      
istio-pilot/0*                active    idle   10.1.19.62                      
jupyter-controller/0*         active    idle   10.1.19.60                      
jupyter-ui/0*                 active    idle   10.1.19.55   5000/TCP           
katib-controller/0*           active    idle   10.1.19.46   443/TCP,8080/TCP   
katib-db-manager/0*           active    idle   10.1.19.2    6789/TCP           
katib-db/0*                   active    idle   10.1.19.57   3306/TCP           ready
katib-ui/0*                   active    idle   10.1.19.36   8080/TCP           
kfp-api/0*                    active    idle   10.1.19.85   8888/TCP,8887/TCP  
kfp-db/0*                     active    idle   10.1.19.23   3306/TCP           ready
kfp-persistence/0*            active    idle   10.1.19.80                      
kfp-profile-controller/0*     active    idle   10.1.19.83   80/TCP             
kfp-schedwf/0*                active    idle   10.1.19.22                      
kfp-ui/0*                     active    idle   10.1.19.87   3000/TCP           
kfp-viewer/0*                 active    idle   10.1.19.51                      
kfp-viz/0*                    active    idle   10.1.19.78   8888/TCP           
kubeflow-dashboard/0*         active    idle   10.1.19.86   8082/TCP           
kubeflow-profiles/0*          active    idle   10.1.19.49   8080/TCP,8081/TCP  
kubeflow-roles/0*             active    idle   10.1.19.31                      
kubeflow-volumes/0*           active    idle   10.1.19.63   5000/TCP           
metacontroller-operator/0*    active    idle   10.1.19.1                       
minio/0*                      active    idle   10.1.19.47   9000/TCP,9001/TCP  
oidc-gatekeeper/0*            active    idle   10.1.19.89   8080/TCP           
seldon-controller-manager/0*  active    idle   10.1.19.24   8080/TCP,4443/TCP  
tensorboard-controller/0*     active    idle   10.1.19.90   9443/TCP           
tensorboards-web-app/0*       active    idle   10.1.19.88   5000/TCP           
training-operator/0*          active    idle   10.1.19.38                      

Access the dashboard at http://10.64.140.43.nip.io. If you are running it on cloud and want to access it from local machine you can create socks proxy as follows.

Ssh to instance (the from which you ran commands above):

ssh -D9999 <user>@<IP>

On your computer’s browser, go to Settings > Network > Network Proxy, and enable SOCKS proxy pointing to: 127.0.0.1:9999

On a new browser window, access the link given in the previous step, appended by .nip.io, for example: http://10.64.140.43.nip.io

Try Kubeflow examples

Charmed Kubeflow can be run on single-node and multi-node DGX hardware. Depending on the environment, there are different requirements that should be followed. There are multiple examples that can be tried out.

Single-node DGX with Charmed Kubeflow examples

There is a GitHub repository that includes all the details about the Single-node DGX with Charmed Kubeflow.

The following examples can be found and tested:

  • Jupyter Notebook example on a single-node DGX in the file gpu-notebook.ipynb from the repository. It also uses multi GPU setup.
  • Kubeflow Pipeline example on a single-nod DGX, that uses the same classifier as the Notebook. It is available in the file gpu-pipeline.ipynb.
Multi-node DGX with Charmed Kubeflow examples

There is a GitHub repository that includes all the details about the Multi-node DGX with Charmed Kubeflow.

The following examples can be found and tested:

  • Training Tensorflow models with multi-GPUs in a Jupyter Notebook using Charmed Kubeflow in the folder multi-gpu-in-notebook, where Jupyter Notebook file is available, gpu-notebook.ipynb.
  • Training Tensorflow models with GPUs in a Kubeflow Pipeline in the folder multi-gpu-in-pipeline.
  • A simulated example of multi-node training in Tensorflow, but using just a single node in the folder multi-node-gpu-simulated. There are going to be multiple files describing the workload distribution and how to run it.
  • Multi-node training in Tensorflow using the Kubeflow Training Operator’s TFJob in the folder multi-node-gpu-tfjob.

Last updated 11 days ago.