Serve a BERT model using NVIDIA Triton Inference Server.
Prerequisites
An active Charmed Kubeflow deployment. For installation instructions, follow the Get started tutorial.Contents
Refresh the knative-serving
charm
upgrade the knative-serving
charm to channel latest/edge
juju refresh knative-serving --channel=latest/edge
Wait until the charm is in active
status, you can watch the status with:
juju status --watch 5s
Create a Notebook
Create a Kubeflow Jupyter Notebook. The Notebook will be your workspace from which you run the commands. Running the commands in this guide requires in-cluster communication and instructions won’t work outside of the Notebook environment.The image for the Notebook can be anything since we will be only using the CLI. You can leave it as the default.
Connect to the Notebook, and start a new terminal from the Launcher as shown below.
Use this terminal session to run the commands in the next sections.
Create the InferenceService
Define a new InferenceService yaml for the BERT model with the following content:cat <<EOF > "./isvc.yaml"
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "bert-v2"
annotations:
"sidecar.istio.io/inject": "false"
spec:
transformer:
containers:
- name: kserve-container
image: kfserving/bert-transformer-v2:latest
command:
- "python"
- "-m"
- "bert_transformer_v2"
env:
- name: STORAGE_URI
value: "gs://kfserving-examples/models/triton/bert-transformer"
predictor:
triton:
runtimeVersion: 20.10-py3
resources:
limits:
cpu: "1"
memory: 8Gi
requests:
cpu: "1"
memory: 8Gi
storageUri: "gs://kfserving-examples/models/triton/bert"
EOF
Disable istio sidecar
In the ISVC yaml, make sure to add the annotation "sidecar.istio.io/inject": "false"
as done in the example above.
Due to issue GH 216, you will not be able to reach the ISVC without disabling istio sidecar injection.
GPU Scheduling
For running on GPU, specify the GPU resources in the ISVC yaml. For example, to run the predictor on NVIDIA GPU:
cat <<EOF > "./isvc-gpu.yaml"
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "bert-v2"
spec:
transformer:
containers:
- name: kserve-container
image: kfserving/bert-transformer-v2:latest
command:
- "python"
- "-m"
- "bert_transformer_v2"
env:
- name: STORAGE_URI
value: "gs://kfserving-examples/models/triton/bert-transformer"
predictor:
triton:
runtimeVersion: 20.10-py3
resources: # specifiy gpu limits and vendor
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
storageUri: "gs://kfserving-examples/models/triton/bert"
EOF
See more: Kubernetes | Schedule GPUs
Modify the ISVC yaml to set the node selector, node affinity, or tolerations in the ISVC to match your GPU node.
Expand to see an ISVC yaml with node scheduling attributes
cat <<EOF > "./isvc.yaml"
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "bert-v2"
spec:
transformer:
containers:
- name: kserve-container
image: kfserving/bert-transformer-v2:latest
command:
- "python"
- "-m"
- "bert_transformer_v2"
env:
- name: STORAGE_URI
value: "gs://kfserving-examples/models/triton/bert-transformer"
predictor:
nodeSelector:
myLabel1: "true"
tolerations:
- key: "myTaint1"
operator: "Equal"
value: "true"
effect: "NoSchedule"
triton:
runtimeVersion: 20.10-py3
resources: # specifiy gpu limits and vendor
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
storageUri: "gs://kfserving-examples/models/triton/bert"
EOF
This example sets nodeSelector
and tolerations
for the predictor
. Similarly, you can set the affinity
.
Apply the ISVC to your namespace with kubectl
kubectl apply -f ./isvc.yaml -n <namespace>
Since we are using the CLI from inside a Notebook, kubectl is using the ServiceAccount credentials of the Notebook pod.
Wait until the InferenceService is in Ready
state. It can take a few minutes to be Ready
because of pulling the large-size triton image.
You can check on the state with:
kubectl get inferenceservice bert-v2 -n <namespace>
you should see an output similar to this:
NAME URL READY AGE
bert-v2 http://bert-v2.default.10.64.140.43.nip.io True 71s
Perform inference
Get the ISVC’s status.address.url
URL=$(kubectl get inferenceservice bert-v2 -n <namespace> -o jsonpath='{.status.address.url}')
Make a request to the ISVC’s URL
- Prepare the inference input
cat <<EOF > "./input.json"
{
"instances": [
"What President is credited with the original notion of putting Americans in space?"
]
}
EOF
- Make a prediction request
curl -v -H "Content-Type: application/json" ${URL}/v1/models/bert-v2:predict -d @./input.json
The response will contain the prediction output:
{"predictions": "John F. Kennedy", "prob": 77.91851169430718}
Last updated 7 months ago.