Restore Charmed Kubeflow

Restore Charmed Kubeflow

The following instructions will allow you to restore the Charmed Kubeflow (CKF) control plane data from a compatible S3 storage.

:warning: It is expected that these steps are followed all at once for restoring the CKF control plane, that is, restoring all databases, pipelines MinIO bucket, and ML Metadata databse at the same time. Failing to do so may result in data loss.

:warning: Running Kubeflow pipelines and Katib experiments can affect the outcome of the restore, please make sure all pipelines and experiments are stopped and no other processes are calling them (e.g. Jupyter Notebooks).

:warning: User workloads in user namespaces will not be restored.

Pre-requisites

  1. Access to a S3 storage - only AWS S3 and S3 RadosGW are supported

    This S3 storage will be used for storing all backup data from the CKF control plane.
    
  2. Admin access to the Kubernetes cluster where CKF is deployed

  3. Juju admin access to the kubeflow model

  4. yq binary

  5. Ensure the local storage is big enough to copy backup data

Configure rclone

rclone is a tool that allows file management in cloud storage. This tool will be used for backing up several files throughout this guide and it can be installed as a snap:

sudo snap install rclone

Connect to a shared S3 storage

1. Configure rclone to connect to the shared S3 storage. The following can be used as reference.

[remote-s3]
type = s3
provider = AWS
env_auth = true
access_key_id = ...
secret_access_key = ...
region = eu-central-1
acl = private
server_side_encryption = AES256

You can check where this configuration file is located with rclone config file

2. Save the name of the S3 remote in an ENV variable.

RCLONE_S3_REMOTE=remote-s3

Connect to CKF MinIO

1. The following steps require an accessible MinIO endpoint, which can be done port forwarding the minio Service:

kubectl port-forward -n kubeflow svc/minio 9000:9000

2. Get minio’s secret-key value:

juju show-unit kfp-ui/0 \
        | yq '.kfp-ui/0.relation-info.[] | select (.endpoint == "object-storage") | .application-data.data' \
        | yq '.secret-key'

3. Get minio’s access-key:

juju config minio access-key

4. Configure rclone to connect to CKF MinIO. The following can be used as reference.

[minio-ckf]
type = s3
provider = Minio
access_key_id = minio
secret_access_key = ...
endpoint = http://localhost:9000
acl = private

5. Save the name of the MinIO remote in an ENV variable.

RCLONE_MINIO_CKF_REMOTE=minio-ckf

Restore CKF databases to S3 storage

CKF uses katib-db and kfp-db as databases for Katib and Kubeflow pipelines respectively.

1. Deploy and configure the s3-integrator to connect to the shared S3 storage.

Follow the S3 AWS and S3 Radowsg configuration guides for this step.

2. Scale up kfp-db and katib-db.

This step avoids the Primary database from becoming unavailable during backup.

juju scale-application kfp-db 2
juju scale-application katib-db 2

3. Restore kfp-db and katib-db.

Please replace mysql-k8s with the name of the database you intend to create a backup for in the commands form that guide. E.g. katib-db instead of mysql-k8s.

Restore ML Metadata using sqlite3

The mlmd charm uses a SQLite database to store ML metadata generated from Kubeflow pipelines.

1. Install the required tools inside the application container

This guide expects the mlmd application container to have internet access, if that is not the case, please check Restore ML Metadata with kubectl.

# MLMD > 1.14, CKF 1.9
MLMD_POD="mlmd-0"
MLMD_CONTAINER="mlmd"

# MLMD 1.14, CKF 1.8
MLMD_POD="mlmd-0"
MLMD_CONTAINER="mlmd-grpc-server"

kubectl exec -n kubeflow $MLMD_POD -c $MLMD_CONTAINER -- \
    /bin/bash -c "apt update && apt install sqlite3 -y"

2. Scale down kfp-metadata-writer

juju scale-application kfp-metadata-writer 0

3. Copy the backup file from the shared S3 storage to a local storage

S3_BUCKET=backup-bucket-2024
RCLONE_S3_REMOTE=remote-s3

rclone --size-only copy \
	$RCLONE_S3_REMOTE:$S3_BUCKET/$MLMD_BACKUP .

4. Restore data from a backup file

Copy the local database file into the application container

kubectl cp -n kubeflow -c $MLMD_CONTAINER \
	$MLMD_BACKUP \
	$MLMD_POD:/tmp/$MLMD_BACKUP

Move the current database file to a temporary directory

kubectl exec -n kubeflow $MLMD_POD -c $MLMD_CONTAINER -- \
	/bin/bash -c "mv /data/mlmd.db /tmp/mlmd.current"

Restore the database from the backup file

kubectl exec -n kubeflow $MLMD_POD -c $MLMD_CONTAINER -- \
	/bin/bash -c "zcat /tmp/$MLMD_BACKUP | sqlite3 /data/mlmd.db"

5. Optionally remove the local backup file

rm -rf $MLMD_BACKUP

6. Scale up kfp-metadata-writer.

$ juju scale-application kfp-metadata-writer 1

Restore mlpipeline MinIO bucket

Sync all files from the shared S3 storage to minio

S3_BUCKET=backup-bucket-2024
RCLONE_S3_REMOTE=remote-s3
RCLONE_BWIDTH_LIMIT=20M

rclone --size-only sync \
	--bwlimit $RCLONE_BWIDTH_LIMIT \
	$RCLONE_S3_REMOTE:$S3_BUCKET/mlpipeline \
	$RCLONE_MINIO_REMOTE:mlpipeline 

Alternative backup methods

Restore ML Metadata using kubectl cp

The mlmd charm uses a SQLite database to store ML metadata generated from Kubeflow pipelines.

1. Scale down kfp-metadata-writer

juju scale-application kfp-metadata-writer 0

2. Copy the backup file from the shared S3 storage to a local storage

S3_BUCKET=backup-bucket-2024
RCLONE_S3_REMOTE=remote-s3

rclone --size-only copy \
	$RCLONE_S3_REMOTE:$S3_BUCKET/$MLMD_BACKUP .

3. Restore data from a backup file

Copy the local database file into the application container

kubectl cp -n kubeflow -c $MLMD_CONTAINER \
	$MLMD_BACKUP \
	$MLMD_POD:/tmp/$MLMD_BACKUP

Move the current database file to a temporary directory

kubectl exec -n kubeflow $MLMD_POD -c $MLMD_CONTAINER -- \
	/bin/bash -c "mv /data/mlmd.db /tmp/mlmd.current"

Place the backup file into the data path

kubectl exec -n kubeflow $MLMD_POD -c $MLMD_CONTAINER -- \
	/bin/bash -c "mv /tmp/$MLMD_BACKUP /data/mlmd.db"

4. Optionally remove the local backup file

rm -rf $MLMD_BACKUP

4. Scale up kfp-metadata-writer.

$ juju scale-application kfp-metadata-writer 1

Last updated 8 days ago.