Charmed Kubeflow charms Prometheus alerts

The following Charmed Kubeflow charms provide default alerts to facilitate their monitoring. For more information on alert rules and how they are defined, see the corresponding Prometheus documentation.

Contents:

Understanding alert table columns

The alert tables below use the following columns:

  • Alert: The name of the alert inside Prometheus dashboard.
  • Description: When the alert goes into Firing state.
  • Severity: The severity of each alert.

Argo controller

Alert Description Severity
ArgoWorkflowErrorLoglines There are more than 10 new Error log lines every minute for at least the past 4 minutes. Critical
ArgoWorkflowWarningLoglines There are more than 40 new Warning log lines every minute for at least the past 4 minutes. Warning
ArgoUnitIsUnavailable The argo-controller unit is down for the past 5 minutes Critical
ArgoWorkflowsErroring At least one more argo workflow went in Error status every minute for at least the past 10 minutes. Warning
ArgoWorkflowsFailed At least one more argo workflow went in Failed status every minute for at least the past 10 minutes. Warning
ArgoWorkflowsPending At least one more argo workflow went in Pending status every minute for at least the past 10 minutes. Warning

Dex Auth

Alert Description Severity
DexAuthUnitIsUnavailable The dex-auth unit is down for at least the past 5 minutes. Critical

Envoy

Alert Description Severity
EnvoyUnitIsUnavailable The envoy unit is down during the last 1 minute. Critical

Jupyter controller

Alert Description Severity
UnfinishedWorkQueueAlert The amount of unfinished work in the workqueue has increased significantly during the past 5 minutes. Critical
FileDescriptorsExhausted The file descriptors have reached 98% of the maximum available. Critical
FileDescriptorsSoonToBeExhausted The file descriptors are predicted to be exhausted 1 hour later. High
JupyterControllerRuntimeReconciliationErrorsExceedThreshold Total number of controller runtime reconciliation errors has increased during the past 5 minutes. Critical
JupyterControllerUnitIsUnavailable The jupyter-controller unit is down for at least the past 5 minutes. Critical

Katib controller

Alert Description Severity
KatibControllerUnitIsUnavailable The katib-controller unit is down during the last 1 minute. Critical

KFP api

Alert Description Severity
KfpApiUnitIsUnavailable The kfp-api unit is down during the last 1 minute. Critical

Metacontroller operator

Alert Description Severity
MetacontrollerUnitIsUnavailable The metacontroller-operator unit is down for at least the past 5 minutes. Critical

MinIO

Alert Description Severity
MinioUnitIsUnavailable The minio unit is down for at least the past 5 minutes. Critical

Seldon controller manager

Alert Description Severity
SeldonWorkqueueTooManyRetries Total number of retries handled by workqueue has increased during the past 10 minutes. Critical
SeldonHTTPError Number of HTTP requests with status code 4XX has increased during the past 10 minutes. Critical
SeldonReconcileError Total number of controller runtime reconciliations that resulted in error has increased during the past 10 minutes. Critical
SeldonUnfinishedWorkIncrease The amount of unfinished work in the workqueue has increased during the past 10 minutes. Critical
SeldonWebhookError Total number of admission HTTP requests with status code 5XX has increased during the past 10 minutes. Critical
SeldonUnitIsUnavailable The seldon-controller-manager unit is down for at least the past 5 minutes. Critical

Training operator

Alert Description Severity
TrainingOperatorUnitIsUnavailable The training-operator unit is down for at least the past 5 minutes. Critical

Last updated 7 days ago.