Deploy with vLLM
Once you've compressed your model using LLM Compressor, you can deploy it for efficient inference using vLLM. This guide walks you through the deployment process, using the output from the Compress Your Model guide. If you haven't completed that step, change the model arguments in the code snippets below to point to your desired model.
vLLM is a high-performance inference engine designed for large language models, providing support for various quantization formats and optimized for both single and multi-GPU setups. It also offers an OpenAI-compatible API for easy integration with existing applications.
Prerequisites
Before deploying your model, ensure you have the following prerequisites: - Operating System: Linux (recommended for GPU support) - Python Version: 3.9 or newer - Available GPU: For optimal performance, it's recommended to use a GPU. vLLM supports a range of accelerators, including NVIDIA GPUs, AMD GPUs, TPUs, and other accelerators. - vLLM Installed: Ensure you have vLLM installed. You can install it using pip:
Python API
vLLM provides a Python API for easy integration with your applications, enabling you to load and use your compressed model directly in your Python code. To test the compressed model, use the following code:
from vllm import LLM
model = LLM("./TinyLlama-1.1B-Chat-v1.0-INT8")
output = model.generate("What is machine learning?", max_tokens=256)
print(output)
After running the above code, you should see the generated output from your compressed model. This confirms that the model is loaded and ready for inference.
HTTP Server
vLLM also provides an HTTP server for serving your model via a RESTful API that is compatible with OpenAI's API definitions. This allows you to easily integrate your model into existing applications or services. To start the HTTP server, use the following command:
By default, the server will run on localhost:8000
. You can change the host and port by using the --host
and --port
flags. Now that the server is running, you can send requests to it using any HTTP client. For example, you can use curl
to send a request:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TinyLlama-1.1B-Chat-v1.0-INT8",
"messages": [{"role": "user", "content": "What is machine learning?"}],
"max_tokens": 256
}'
This will return a JSON response with the generated text from your model. You can also use any HTTP client library in your programming language of choice to send requests to the server.