vLLM’s Open Governance and Performance Roadmap
We would like to share two updates to the vLLM community.
Future of vLLM is Open
We are excited to see vLLM is becoming the standard for LLM inference and serving. In the recent Meta Llama 3.1 announcement, 8 out of 10 official partners for real time inference run vLLM as the serving engine for the Llama 3.1 models. We have also heard anecdotally that vLLM is being used in many of the AI features in our daily life.
We believe vLLM’s success comes from the power of the strong open source community. vLLM is actively maintai ned by a consortium of groups such as UC Berkeley, Anyscale, AWS, CentML, Databricks, IBM, Neural Magic, Roblox, Snowflake, and others. To this extent, we want to ensure the ownership and governance of the project is open an d transparent as well.
We are excited to announce that vLLM has started the incubation process into LF AI & Data Foundation. This means no one party will have exclusive control over the future of vLLM. The license and trademark will be irrevocably open. You can trust vLLM is here to stay and will be actively maintained and improved going forward.
Performance is top priority
The vLLM contributors are doubling down to ensure vLLM is a fastest and easiest-to-use LLM inference and serving engine.
To recall our roadmap, we focus vLLM on six objectives: wide model coverage, broad hardware support, top performance, production-ready, thriving open source community, and extensible architecture.
In our objective for performance optimization, we have made the following progress to date:
- Publication of benchmarks
- Published per-commit performance tracker at perf.vllm.ai on our public benchmarks. The goal of this is to track performance enhancement and regressions.
- Published reproducible benchmark (docs) of vLLM compared to LMDeploy, TGI, and TensorRT-LLM. The goal is to identify gaps in performance and close them.
- Development and integration of highly optimized kernels
- Integrated FlashAttention2 with PagedAttention, and FlashInfer. We plan to integrate FlashAttention3.
- Integrating Flux which overlaps computation and collective communication.
- Developed state of the art kernels for quantized inference, including INT8 and FP8 activation quantization (via cutlass) and INT4, INT8, and FP8 weight-only quantization for GPTQ and AWQ (via marlin).
- Started several work streams to lower critical overhead
- We identified vLLM’s synchronous and blocking scheduler is a key bottleneck for models running on fast GPUs (H100s). We are working on making the schedule asynchronous and plan steps ahead of time.
- We identified vLLM’s OpenAI-compatible API frontend has higher than desired overhead. We are working on isolating it from the critical path of scheduler and model inference.
- We identified vLLM’s input preparation, and output processing scale suboptimally with the data size. Many of the operations can be vectorized and enhanced by moving them off the critical path.
We will continue to update the community in vLLM’s progress in closing the performance gap. You can track our overall progress here. Please continue to suggest new ideas and contribute with your improvements!
More Resources
We would like to highlight the following RPCs being actively developed
- Single Program Multiple Data (SPMD) Worker Control Plane reduces complexity and enhances performance of tensor parallel performance.
- A Graph Optimization System in vLLM using torch.compile brings in PyTorch native compilation workflow for kernel fusion and compilation.
- Implement disaggregated prefilling via KV cache transfer is critical for workload with long input and lowers variance in inter-token latency.
There is a thriving research community building their research projects on top of vLLM. We are deeply humbled by the impressive works and would love to collaborate and integrate. The list of papers includes but is not limited to:
- Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
- Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
- Llumnix: Dynamic Scheduling for Large Language Model Serving
- CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving
- vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
- Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services
- SGLang: Efficient Execution of Structured Language Model Programs