Charmed Kubeflow charms Prometheus alerts
The following Charmed Kubeflow charms provide default alerts to facilitate their monitoring. For more information on alert rules and how they are defined, see the corresponding Prometheus documentation.
Contents:
- Understanding alert tables
- Argo controller
- Dex auth
- Envoy
- Jupyter controller
- Katib controller
- KFP api
- Metacontroller operator
- MinIO
- Seldon controller manager
- Training operator
Understanding alert table columns
The alert tables below use the following columns:- Alert: The name of the alert inside Prometheus dashboard.
- Description: When the alert goes into
Firing
state. - Severity: The severity of each alert.
Argo controller
Alert | Description | Severity |
---|---|---|
ArgoWorkflowErrorLoglines | There are more than 10 new Error log lines every minute for at least the past 4 minutes. | Critical |
ArgoWorkflowWarningLoglines | There are more than 40 new Warning log lines every minute for at least the past 4 minutes. | Warning |
ArgoUnitIsUnavailable | The argo-controller unit is down for the past 5 minutes | Critical |
ArgoWorkflowsErroring | At least one more argo workflow went in Error status every minute for at least the past 10 minutes. | Warning |
ArgoWorkflowsFailed | At least one more argo workflow went in Failed status every minute for at least the past 10 minutes. | Warning |
ArgoWorkflowsPending | At least one more argo workflow went in Pending status every minute for at least the past 10 minutes. | Warning |
Dex Auth
Alert | Description | Severity |
---|---|---|
DexAuthUnitIsUnavailable | The dex-auth unit is down for at least the past 5 minutes. | Critical |
Envoy
Alert | Description | Severity |
---|---|---|
EnvoyUnitIsUnavailable | The envoy unit is down during the last 1 minute. | Critical |
Jupyter controller
Alert | Description | Severity |
---|---|---|
UnfinishedWorkQueueAlert | The amount of unfinished work in the workqueue has increased significantly during the past 5 minutes. | Critical |
FileDescriptorsExhausted | The file descriptors have reached 98% of the maximum available. | Critical |
FileDescriptorsSoonToBeExhausted | The file descriptors are predicted to be exhausted 1 hour later. | High |
JupyterControllerRuntimeReconciliationErrorsExceedThreshold | Total number of controller runtime reconciliation errors has increased during the past 5 minutes. | Critical |
JupyterControllerUnitIsUnavailable | The jupyter-controller unit is down for at least the past 5 minutes. | Critical |
Katib controller
Alert | Description | Severity |
---|---|---|
KatibControllerUnitIsUnavailable | The katib-controller unit is down during the last 1 minute. | Critical |
KFP api
Alert | Description | Severity |
---|---|---|
KfpApiUnitIsUnavailable | The kfp-api unit is down during the last 1 minute. | Critical |
Metacontroller operator
Alert | Description | Severity |
---|---|---|
MetacontrollerUnitIsUnavailable | The metacontroller-operator unit is down for at least the past 5 minutes. | Critical |
MinIO
Alert | Description | Severity |
---|---|---|
MinioUnitIsUnavailable | The minio unit is down for at least the past 5 minutes. | Critical |
Seldon controller manager
Alert | Description | Severity |
---|---|---|
SeldonWorkqueueTooManyRetries | Total number of retries handled by workqueue has increased during the past 10 minutes. | Critical |
SeldonHTTPError | Number of HTTP requests with status code 4XX has increased during the past 10 minutes. | Critical |
SeldonReconcileError | Total number of controller runtime reconciliations that resulted in error has increased during the past 10 minutes. | Critical |
SeldonUnfinishedWorkIncrease | The amount of unfinished work in the workqueue has increased during the past 10 minutes. | Critical |
SeldonWebhookError | Total number of admission HTTP requests with status code 5XX has increased during the past 10 minutes. | Critical |
SeldonUnitIsUnavailable | The seldon-controller-manager unit is down for at least the past 5 minutes. | Critical |
Training operator
Alert | Description | Severity |
---|---|---|
TrainingOperatorUnitIsUnavailable | The training-operator unit is down for at least the past 5 minutes. | Critical |
Last updated 7 days ago.