How to integrate Charmed Kubeflow with the Canonical Observability Stack (COS)

Charmed Kubeflow (CKF) and the Canonical Observability Stack (COS) can be easily integrated using Juju. This integration opens up the possibility of monitoring Kubeflow metrics.

Contents:

Requirements

Instructions

After deploying Kubeflow there will be a kubeflow model and a cos model. The kubeflow and cos models will have associated controllers kf-controller and cos-controller, respetively. However, the names of the controllers could be anything. If Kubeflow and COS were both deployed to the same cluster, the controller will be the same for both models.

Integration with COS involves adding relations to Prometheus to have access to metrics and to Grafana to provide dashboards. To avoid cross model relations and ensure COS is accessible even from another cluster, Kubeflow components will be related to COS through the Grafana Agent charm. Data will flow from CKF charms through the Grafana agent and then to COS.

The following components provide built-in sample Grafana dashboards:

  • Argo Controller
  • Jupyter Notebook Controller
  • Seldon Controller Manager

Deploy Grafana Agent

In the kubeflow model, deploy the Grafana agent:

juju switch kubeflow
juju deploy -m kubeflow grafana-agent-k8s --channel=edge

Get COS URLs

Get Traefik URLs:

juju switch cos
juju run-action --wait --quiet traefik/leader show-proxied-endpoints --format json | jq -r '.[] | .results | ."proxied-endpoints"' | jq .

Alternatively you can try this:

juju show-unit catalogue/0 | grep url

Note the URLs for later. See the Browse dashboards section of the COS tutorial for more info on getting URLs for COS.

Check connectivity

Check connectivity from Grafana to COS. Try to access any of the URLs (e.g. “catalogue”) from within the Grafana agent:

juju switch kubeflow
juju ssh grafana-agent
curl -I <URL>

Before continuing with any more steps, make sure you get an OK response. This confirms that the Grafana agent can connect to COS.

Make offers from COS

Switch to the cos model and make offers for Prometheus and Grafana from COS:

juju switch cos
juju offer -c cos-controller admin/cos.prometheus:receive-remote-write prometheus-receive-remote-write
juju offer -c cos-controller admin/cos.grafana:grafana-dashboard grafana-dashboards

Consume Offers in Kubeflow

Switch to the kubeflow model and consume the offers from COS for Prometheus and Grafana:

juju switch kubeflow
juju consume -m kf-controller:admin/kubeflow cos-controller:admin/cos.prometheus-receive-remote-write
juju consume -m kf-controller:admin/kubeflow cos-controller:admin/cos.grafana-dashboards

Connect Grafana Agent to endpoints

Tell the Grafana Agent to provide to the two endpoints created by consuming those offers:

juju switch kubeflow
juju add-relation -m kubeflow grafana-agent-k8s prometheus-receive-remote-write
juju add-relation -m kubeflow grafana-agent-k8s:grafana-dashboards-provider grafana-dashboards

Verify that the relations for both offers are in place:

juju status -m cos

We should see 1/1 in the Connected column under Offers.

Prometheus integration

Relate Kubeflow charms to the metrics-endpoint, which will go to Prometheus in COS:

juju switch kubeflow
juju add-relation argo-controller:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju add-relation dex-auth:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju add-relation katib-controller:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju add-relation kfp-api:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju add-relation metacontroller-operator:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju add-relation minio:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju add-relation seldon-controller-manager:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju add-relation training-operator:metrics-endpoint grafana-agent-k8s:metrics-endpoint
juju add-relation jupyter-controller:metrics-endpoint grafana-agent-k8s:metrics-endpoint

Verify the relations were added with juju status --relations.

Grafana integration

Relate Kubeflow charms to the grafana-dashboards-consumer with Grafana in COS:

juju switch kubeflow
juju add-relation argo-controller:grafana-dashboard grafana-agent-k8s:grafana-dashboards-consumer
juju add-relation seldon-controller-manager:grafana-dashboard grafana-agent-k8s:grafana-dashboards-consumer
juju add-relation jupyter-controller:grafana-dashboard grafana-agent-k8s:grafana-dashboards-consumer

Verify the relations were added with juju status --relations.

Access Prometheus metrics

Navigate to the Prometheus metrics URL. From here you can query various Prometheus metrics for Kubeflow. For example, if you type “argo” you should get various metrics suggested like argo_workflows_count.

Access Prometheus alerts

In order to see alerts available for Prometheus, through Prometheus metrics URL, navigate to “Alerts”.

Access Grafana dashboards

Navigate to the Grafana dashboard URL. Get the admin password:

juju run-action --wait grafana/0 get-admin-password

Login as admin with the password returned.

Browse available dashboards by navigating to DashboardsBrowse. There should be the following dashboards available:

  • ArgoWorkflow Metrics
  • Jupyter Notebook Controller
  • Seldon Core

View metrics in ArgoWorkflow Metrics dashboard

Navigate to “ArgoWorkflow Metrics” dashboard. Locate Grafana filters at the top of the dashboard and select the following to see the metrics, if required:

  • Juju model = kubeflow
  • Juju application = argo-controller

The following metrics are provided (scroll down to see more graphs):

  • Number of workflows in error state
  • Number of workflows in failed state
  • Number of workflows in pending state
  • Number of workflows in running state
  • Number of workflows in succeded state
  • Graph of number of workflow currently accessible by controller
  • Graph of number of workflow errors alerting
  • Histogram of workflow operation duration (not execution time)
  • Graph of number of workflow queue adds
  • Graph of depth of work queue
  • Graph of time objects spent in the queue
  • Graph of number of log messages

View metrics in Jupyter Notebook Controller dashboard

Navigate to “Jupyter Notebook Controller” dashboard. Locate Grafana filters at the top of the dashboard and select the following to see the metrics, if required:

  • Juju model = kubeflow
  • Juju application = jupyter-controller

The following metrics are provided:

  • Graph of number of notebooks created on the controller since last restart.
  • Graph of number of notebooks present on the controller.

View metrics in View metrics in Seldon Controller Manager dashboard

Navigate to “Seldon Core” dashboard. Locate Grafana filters at the top of the dashboard and select the following to see the metrics, if required:

  • Juju model = kubeflow
  • Juju application = seldon-controller-manager

The following metrics are provided and describe models deployed to Seldon Core:

  • Graph of number of models rejected and accepted.

Last updated 3 months ago.