production-stack

Tutorial: Basic vLLM Configurations

Introduction

This tutorial guides you through the basic configurations required to deploy a vLLM serving engine in a Kubernetes environment with GPU support. You will learn how to specify the model details, set up necessary environment variables (like HF_TOKEN), and launch the vLLM serving engine.

Table of Contents

  1. Prerequisites
  2. Step 1: Preparing the Configuration File
  3. Step 2: Applying the Configuration
  4. Step 3: Verifying the Deployment

Prerequisites

Step 1: Preparing the Configuration File

  1. Locate the example configuration file tutorials/assets/values-02-basic-config.yaml.
  2. Open the file and update the following fields:
    • Replace <USERS SHOULD PUT THEIR HF_TOKEN HERE> with your actual Hugging Face token.

Explanation of Key Items in values-02-basic-config.yaml

Example Snippet

servingEngineSpec:
  modelSpec:
  - name: "llama3"
    repository: "vllm/vllm-openai"
    tag: "latest"
    modelURL: "meta-llama/Llama-3.1-8B-Instruct"
    replicaCount: 1

    requestCPU: 10
    requestMemory: "16Gi"
    requestGPU: 1

    pvcStorage: "50Gi"

    vllmConfig:
      enableChunkedPrefill: false
      enablePrefixCaching: false
      maxModelLen: 16384
      dtype: "bfloat16"
      extraArgs: ["--disable-log-requests", "--gpu-memory-utilization", "0.8"]

    env:
      - name: HF_TOKEN
        value: <YOUR_HF_TOKEN>

Step 2: Applying the Configuration

  1. Deploy the configuration using Helm:
sudo helm repo add llmstack-repo https://lmcache.github.io/helm/
sudo helm install llmstack llmstack-repo/vllm-stack -f tutorials/assets/values-02-basic-config.yaml

Expected Output

You should see output indicating the successful deployment of the Helm chart:

Release "llmstack" has been deployed. Happy Helming!
NAME: llmstack
LAST DEPLOYED: <timestamp>
NAMESPACE: default
STATUS: deployed
REVISION: 1

Step 3: Verifying the Deployment

  1. Check the status of the pods:
sudo kubectl get pods

Expected Output

You should see the following pods:

NAME                                             READY   STATUS    RESTARTS   AGE
pod/llmstack-deployment-router-xxxx-xxxx         1/1     Running   0          3m23s
llmstack-llama3-deployment-vllm-xxxx-xxxx        1/1     Running   0          3m23s
  1. Verify the service is exposed correctly:
sudo kubectl get services

Expected Output

Ensure there are services for both the serving engine and the router:

NAME                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
llmstack-engine-service   ClusterIP   10.103.98.170    <none>        80/TCP    4m
llmstack-router-service   ClusterIP   10.103.110.107   <none>        80/TCP    4m
  1. Test the health endpoint:
curl http://<SERVICE_IP>/health

Replace <SERVICE_IP> with the external IP of the service. If everything is configured correctly, you will get:

{"status":"healthy"}

Please refer to Step 3 in the 01-minimal-helm-installation tutorial for querying the deployed vLLM service.

Conclusion

In this tutorial, you configured and deployed a vLLM serving engine with GPU support in a Kubernetes environment. You also learned how to verify its deployment and ensure it is running as expected. For further customization, refer to the values.yaml file and Helm chart documentation.