Charmed Kubeflow charms Grafana dashboards

The following Grafana dashboards are provided by Charmed Kubeflow. For more information on Grafana dashboards and how they are defined, see the corresponding Grafana documentation.

Generic dashboards

CKF Charms state

This generic dashboard shows the state (up or down) of CKF charms. Note that this includes only charms that provide metrics. Refer to the Prometheus metrics page for more information on which are those charms.

Pipelines

The following dashboards provide visualisations related to Kubeflow Pipelines (KFP).

ArgoWorkflow Metrics

The metrics from the ArgoWorkflows Controller expose the status of Argo Workflow CustomResources. Important metrics here expose information related to:

  1. The number of workflows that have Failed or are in Error state.
  2. The time workflows spend in the queue before being run.
  3. The total size of captured logs that are pushed into S3 from the workflow pods.

Envoy Service

The metrics from Envoy proxy expose the history of requests proxied from the KFP UI to the MLMD application. Important metrics here expose information related to:

  1. The total number of connections/requests
  2. The success-rate of requests (with status code non 5xx) as well the number of requests with 4xx response, either upstream or downstream.

MinIO Dashboard

The metrics exposed from MinIO expose the status of the S3 storage instance used by KFP. Important metrics here expose information related to:

  1. S3 available space and storage capacity.
  2. S3 traffic.
  3. S3 API request errors and data transferred.
  4. Node CPU, memory, file descriptors and IO usage

Notebooks

The following dashboards provide visualisations related to Kubeflow Notebooks.

Jupyter Notebook Controller

The metrics exposed from Jupyter controller expose the status of Notebook CustomResources. Important metrics here expose information related to notebooks created by the controller and notebooks currently present on it.

Experiments

The following dashboards provide visualisations related to Katib experiments.

Katib Status

The metrics from the Katib controller expose the status of Experiment and Trial CustomResources. Important metrics here expose information related to Experiments and Trials created by the controller, as well as currently present on it.

Serving models

The following dashboards provide visualisations related to serving ML models.

Seldon Core

The metrics from the SeldonCore controller expose the status of Seldon Deployment CustomResources (also called models). Important metrics here expose information related to Seldon deployments currently available on the controller, split between accepted and rejected.


Last updated 7 days ago.