Notes on vLLM v.s. DeepSpeed-FastGen

TL;DR:

vLLM matches DeepSpeed-FastGen’s speed in common scenarios and surpasses it when handling longer outputs.
DeepSpeed-FastGen only outperforms vLLM in scenarios with long prompts and short outputs, due to its Dynamic SplitFuse optimization. This optimization is on vLLM’s roadmap.
vLLM’s mission is to build the fastest and easiest-to-use open-source LLM inference and serving engine. It is Apache 2.0 and community-owned, offering extensive model and optimization support.

The DeepSpeed team recently published a blog post claiming 2x throughput improvement over vLLM, achieved by leveraging the Dynamic SplitFuse technique. We are happy to see the technology advancements from the open-source community. In this blog, we show the specific scenarios where the Dynamic SplitFuse technique is advantageous, noting that these cases are relatively limited. For the majority of workloads, vLLM is faster than (or performs comparably to) DeepSpeed-FastGen.

Performance Benchmark

We’ve identified two key differences between vLLM and DeepSpeed-FastGen in terms of performance optimization:

DeepSpeed-FastGen adopts a conservative/suboptimal memory allocation scheme, which wastes memory when output lengths are large.
DeepSpeed-FastGen’s Dynamic SplitFuse scheduling gives speedup only when prompt lengths are much greater than output lengths.

As a result, DeepSpeed-FastGen outperforms when the workload is consistently long prompt and short output. In other scenarios, vLLM shows superior performance.

We benchmarked the two systems on an NVIDIA A100-80GB GPU with the LLaMA-7B model in the following scenarios:

Scenario 1: Long Prompt Length, Short Output

Here, DeepSpeed-FastGen’s Dynamic SplitFuse scheduling is expected to shine. However, the performance gain we observe isn’t as significant as 2x.

Scenario 2: Other cases

In these cases, vLLM is up to 1.8x faster than DeepSpeed-FastGen.

vLLM’s Future: A True Community Project

We are committed to making vLLM the best open-source project incorporating the community’s best models, optimizations, and hardware. Coming out of UC Berkeley Sky Computing Lab, we are building vLLM truly in open source with the Apache 2.0 license.

The vLLM team prioritizes collaborations and we strive to keep the codebase with high quality code and easy to contribute. We are actively working on system performance; as well as new features like LoRA, Speculative Decoding, and better Quantization Support. Additionally, we are collaborating with hardware vendors like AMD, AWS Inferenetia, and Intel Habana to bring LLM to the broadest community.

Specifically for the Dynamic SplitFuse optimization, we are actively investigating the proper integration. If you have any questions and suggestions, please feel free to contact us on GitHub. We also published the benchmark code here.

Appendix: Feature Comparison

DeepSpeed-FastGen currently offers basic functionalities, supporting only three model types and lacking popular features like stop strings and parallel sampling (e.g., beam search). We do expect the DeepSpeed-FastGen is eager to catch up and we welcome the creative innovation in the market!

	vLLM	DeepSpeed-FastGen
Runtime	Python/PyTorch	Python/PyTorch
Model implementation	HuggingFace Transformers	Custom implementation + converter for HF models
Server frontend	Simple FastAPI server for demo purposes	Custom gRPC-based server
Scheduling	Continuous batching	Dynamic SplitFuse
Attention kernel	PagedAttention & FlashAttention	PagedAttention & FlashAttention
Custom kernels (for LLaMA)	Attention, RoPE, RMS, SILU	Attention, RoPE, RMS, SILU, Embedding
KV Cache allocation	Near-optimal	Suboptimal/conservative
Supported models	16 different architectures	LLaMA, Mistral, OPT
Sampling methods	Random, parallel, beam search	Random
Stop criterion	Stop strings, stop tokens, EOS	EOS