The following Charmed Kubeflow charms provide default alerts to facilitate their monitoring. For more information on alert rules and how they are defined, see the corresponding Prometheus documentation.
Contents:
The alert tables below use the following columns:
- Alert: The name of the alert inside Prometheus dashboard.
- Description: When the alert goes into
Firing
state.
- Severity: The severity of each alert.
Alert |
Description |
Severity |
ArgoWorkflowErrorLoglines |
There are more than 10 new Error log lines every minute for at least the past 4 minutes. |
Critical |
ArgoWorkflowWarningLoglines |
There are more than 40 new Warning log lines every minute for at least the past 4 minutes. |
Warning |
ArgoUnitIsUnavailable |
The argo-controller unit is down for the past 5 minutes |
Critical |
ArgoWorkflowsErroring |
At least one more argo workflow went in Error status every minute for at least the past 10 minutes. |
Warning |
ArgoWorkflowsFailed |
At least one more argo workflow went in Failed status every minute for at least the past 10 minutes. |
Warning |
ArgoWorkflowsPending |
At least one more argo workflow went in Pending status every minute for at least the past 10 minutes. |
Warning |
Alert |
Description |
Severity |
DexAuthUnitIsUnavailable |
The dex-auth unit is down for at least the past 5 minutes. |
Critical |
Alert |
Description |
Severity |
EnvoyUnitIsUnavailable |
The envoy unit is down during the last 1 minute. |
Critical |
Alert |
Description |
Severity |
UnfinishedWorkQueueAlert |
The amount of unfinished work in the workqueue has increased significantly during the past 5 minutes. |
Critical |
FileDescriptorsExhausted |
The file descriptors have reached 98% of the maximum available. |
Critical |
FileDescriptorsSoonToBeExhausted |
The file descriptors are predicted to be exhausted 1 hour later. |
High |
JupyterControllerRuntimeReconciliationErrorsExceedThreshold |
Total number of controller runtime reconciliation errors has increased during the past 5 minutes. |
Critical |
JupyterControllerUnitIsUnavailable |
The jupyter-controller unit is down for at least the past 5 minutes. |
Critical |
Alert |
Description |
Severity |
KatibControllerUnitIsUnavailable |
The katib-controller unit is down during the last 1 minute. |
Critical |
Alert |
Description |
Severity |
KfpApiUnitIsUnavailable |
The kfp-api unit is down during the last 1 minute. |
Critical |
Alert |
Description |
Severity |
MetacontrollerUnitIsUnavailable |
The metacontroller-operator unit is down for at least the past 5 minutes. |
Critical |
Alert |
Description |
Severity |
MinioUnitIsUnavailable |
The minio unit is down for at least the past 5 minutes. |
Critical |
Alert |
Description |
Severity |
SeldonWorkqueueTooManyRetries |
Total number of retries handled by workqueue has increased during the past 10 minutes. |
Critical |
SeldonHTTPError |
Number of HTTP requests with status code 4XX has increased during the past 10 minutes. |
Critical |
SeldonReconcileError |
Total number of controller runtime reconciliations that resulted in error has increased during the past 10 minutes. |
Critical |
SeldonUnfinishedWorkIncrease |
The amount of unfinished work in the workqueue has increased during the past 10 minutes. |
Critical |
SeldonWebhookError |
Total number of admission HTTP requests with status code 5XX has increased during the past 10 minutes. |
Critical |
SeldonUnitIsUnavailable |
The seldon-controller-manager unit is down for at least the past 5 minutes. |
Critical |
Alert |
Description |
Severity |
TrainingOperatorUnitIsUnavailable |
The training-operator unit is down for at least the past 5 minutes. |
Critical |
Last updated 2 months ago.