production-stack

Tutorial: Loading Model Weights from Persistent Volume

Introduction

In this tutorial, you will learn how to load a model from a Persistent Volume (PV) in Kubernetes to optimize deployment performance. The steps include creating a PV, matching it using pvcMatchLabels, and deploying the Helm chart to utilize the PV. You will also verify the setup by examining the contents and measuring performance improvements.

Prerequisites
Step 1: Creating a Persistent Volume
Step 2: Deploying with Helm Using the PV
Step 3: Verifying the Deployment

Prerequisites

A running Kubernetes cluster with GPU support.
Completion of previous tutorials:
Basic understanding of Kubernetes PV and PVC concepts.

Step 1: Creating a Persistent Volume

Locate the persistent Volume manifest file at tutorials/assets/pv-03.yaml) with the following content:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: test-vllm-pv
  labels:
    model: "llama3-pv"
spec:
  capacity:
    storage: 50Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: standard
  hostPath:
    path: /data/llama3

Note: You can change the path specified in the hostPath field to any valid directory on your Kubernetes node.

Apply the manifest:

sudo kubectl apply -f tutorials/assets/pv-03.yaml

Verify the PV is created:

sudo kubectl get pv

Expected Output

NAME           CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   AGE
test-vllm-pv   50Gi       RWO            Retain           Available           standard       2m

Step 2: Deploying with Helm Using the PV

Locate the example values file at tutorials/assets/values-03-match-pv.yaml with the following content:

servingEngineSpec:
  modelSpec:
  - name: "llama3"
    repository: "vllm/vllm-openai"
    tag: "latest"
    modelURL: "meta-llama/Llama-3.1-8B-Instruct"
    replicaCount: 1

    requestCPU: 10
    requestMemory: "16Gi"
    requestGPU: 1

    pvcStorage: "50Gi"
    pvcMatchLabels:
      model: "llama3-pv"

    vllmConfig:
      maxModelLen: 4096

    env:
      - name: HF_TOKEN
        value: <YOUR_HF_TOKEN>

Explanation: The pvcMatchLabels field specifies the labels to match an existing Persistent Volume. In this example, it ensures that the deployment uses the PV with the label model: "llama3-pv". This provides a way to link a specific PV to your application.

Note: Make sure to replace <YOUR_HF_TOKEN> with your actual Hugging Face token in the env section.

Deploy the Helm chart:

sudo helm install llmstack llmstack-repo/vllm-stack -f tutorials/assets/values-03-match-pv.yaml

Verify the deployment:

sudo kubectl get pods

Expected Output

NAME                                             READY   STATUS    RESTARTS   AGE
llmstack-deployment-router-xxxx-xxxx             1/1     Running   0          1m
llmstack-llama3-deployment-vllm-xxxx-xxxx        1/1     Running   0          1m

Step 3: Verifying the Deployment

Check the contents of the host directory:
- If using a standard Kubernetes node:
```
sudo ls /data/llama3
```
- If using Minikube, access the Minikube VM and check the path:
```
sudo minikube ssh
ls /data/llama3/hub
```

Expected Output

You should see the model files loaded into the directory:

models--meta-llama--Llama-3.1-8B-Instruct  version.txt

Uninstall and reinstall the deployment to observe faster startup:

sudo helm uninstall llmstack
sudo kubectl delete -f tutorials/assets/pv-03.yaml && sudo kubectl apply -f tutorials/assets/pv-03.yaml
sudo helm install llmstack llmstack-repo/vllm-stack -f tutorials/assets/values-03-match-pv.yaml

Explanation

During the second installation, the serving engine starts faster because the model files are already loaded into the Persistent Volume.

Conclusion

In this tutorial, you learned how to utilize a Persistent Volume to store model weights for a vLLM serving engine. This approach optimizes deployment performance and demonstrates the benefits of Kubernetes storage resources. Continue exploring advanced configurations in future tutorials.

This site is open source. Improve this page.

production-stack

Tutorial: Loading Model Weights from Persistent Volume

Introduction

Table of Contents

Prerequisites

Step 1: Creating a Persistent Volume

Expected Output

Step 2: Deploying with Helm Using the PV

Expected Output

Step 3: Verifying the Deployment

Expected Output

Explanation

Conclusion