This tutorial guides you through the basic configurations required to deploy a vLLM serving engine in a Kubernetes environment with GPU support. You will learn how to specify the model details, set up necessary environment variables (like HF_TOKEN
), and launch the vLLM serving engine.
HF_TOKEN
).tutorials/assets/values-02-basic-config.yaml
.<USERS SHOULD PUT THEIR HF_TOKEN HERE>
with your actual Hugging Face token.values-02-basic-config.yaml
name
: The unique identifier for your model deployment.repository
: The Docker repository containing the model’s serving engine image.tag
: Specifies the version of the model image to use.modelURL
: The URL pointing to the model on Hugging Face or another hosting service.replicaCount
: The number of replicas for the deployment, allowing scaling for load.requestCPU
: The amount of CPU resources requested per replica.requestMemory
: Memory allocation for the deployment; sufficient memory is required to load the model.requestGPU
: Specifies the number of GPUs to allocate for the deployment.pvcStorage
: Defines the Persistent Volume Claim size for model storage.vllmConfig
: Contains model-specific configurations:
enableChunkedPrefill
: Optimizes performance by prefetching model chunks.enablePrefixCaching
: Speeds up response times for common prefixes in queries.maxModelLen
: The maximum sequence length the model can handle.dtype
: Data type for computations, e.g., bfloat16
for faster performance on modern GPUs.extraArgs
: Additional arguments passed to the vLLM engine for fine-tuning behavior.env
: Environment variables such as HF_TOKEN
for authentication with Hugging Face.servingEngineSpec:
modelSpec:
- name: "llama3"
repository: "vllm/vllm-openai"
tag: "latest"
modelURL: "meta-llama/Llama-3.1-8B-Instruct"
replicaCount: 1
requestCPU: 10
requestMemory: "16Gi"
requestGPU: 1
pvcStorage: "50Gi"
vllmConfig:
enableChunkedPrefill: false
enablePrefixCaching: false
maxModelLen: 16384
dtype: "bfloat16"
extraArgs: ["--disable-log-requests", "--gpu-memory-utilization", "0.8"]
env:
- name: HF_TOKEN
value: <YOUR_HF_TOKEN>
sudo helm repo add llmstack-repo https://lmcache.github.io/helm/
sudo helm install llmstack llmstack-repo/vllm-stack -f tutorials/assets/values-02-basic-config.yaml
You should see output indicating the successful deployment of the Helm chart:
Release "llmstack" has been deployed. Happy Helming!
NAME: llmstack
LAST DEPLOYED: <timestamp>
NAMESPACE: default
STATUS: deployed
REVISION: 1
sudo kubectl get pods
You should see the following pods:
NAME READY STATUS RESTARTS AGE
pod/llmstack-deployment-router-xxxx-xxxx 1/1 Running 0 3m23s
llmstack-llama3-deployment-vllm-xxxx-xxxx 1/1 Running 0 3m23s
llmstack-deployment-router
pod acts as the router, managing requests and routing them to the appropriate model-serving pod.llmstack-llama3-deployment-vllm
pod serves the actual model for inference.sudo kubectl get services
Ensure there are services for both the serving engine and the router:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
llmstack-engine-service ClusterIP 10.103.98.170 <none> 80/TCP 4m
llmstack-router-service ClusterIP 10.103.110.107 <none> 80/TCP 4m
llmstack-engine-service
exposes the serving engine.llmstack-router-service
handles routing and load balancing across model-serving pods.curl http://<SERVICE_IP>/health
Replace <SERVICE_IP>
with the external IP of the service. If everything is configured correctly, you will get:
{"status":"healthy"}
Please refer to Step 3 in the 01-minimal-helm-installation tutorial for querying the deployed vLLM service.
In this tutorial, you configured and deployed a vLLM serving engine with GPU support in a Kubernetes environment. You also learned how to verify its deployment and ensure it is running as expected. For further customization, refer to the values.yaml
file and Helm chart documentation.