Speculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and large models in tandem. In this blog, we’ll break down speculative decoding in vLLM, how it works, and the performance improvements it brings.

This content is based on a session from our bi-weekly vLLM Office Hours, where we discuss techniques and updates to optimize vLLM performance. You can view the session slides here. If you prefer watching, you can view the full recording on YouTube. We’d love to see you attend future sessions - please register!

An Introduction to Speculative Decoding

Speculative decoding is a key technique in reducing latency during token generation in large language models (LLMs). This approach leverages smaller models to handle simpler token predictions while utilizing larger models to verify or adjust those predictions. By doing this, speculative decoding accelerates generation without sacrificing accuracy, making it a lossless yet highly efficient method for optimizing LLM performance.

Traditionally, LLMs generate tokens one at a time in an autoregressive manner. For example, given a prompt, the model generates T1, then T2, then T3, and so on, each requiring a separate forward pass. Speculative decoding transforms this process by allowing multiple tokens to be proposed and verified in one forward pass.

Here’s how the process works:

  1. Draft Model: A smaller, more efficient model proposes tokens, such as T1, T2, and T3’.
  2. Target Model Verification: The larger model verifies these tokens in a single forward pass. It confirms correct tokens (T1, T2) and corrects any incorrect ones (replacing T3’ with T3).
  3. Multiple Tokens in One Pass: Instead of generating one token per pass, this method processes multiple tokens simultaneously, reducing latency.

By using this approach, speculative decoding speeds up token generation, making it an effective method for both small-scale and large-scale language model deployments.

How Speculative Decoding Works in vLLM

In vLLM, speculative decoding is integrated with the system’s continuous batching architecture, where different requests are processed together in a single batch, enabling higher throughput. vLLM uses two key components to implement this:

  • Draft Runner: This runner is responsible for executing the smaller model to propose candidate tokens.
  • Target Runner: The target runner verifies the tokens by running the larger model.

vLLM’s system is optimized to handle this process efficiently, allowing speculative decoding to work seamlessly with continuous batching, which increases the overall system performance.


Diagram illustrating how the draft and target runners interact within the vLLM batching system.

To implement speculative decoding in vLLM, two crucial components had to be modified:

  1. Scheduler: The scheduler was adjusted to handle multiple token slots within a single forward pass, enabling the simultaneous generation and verification of several tokens.
  2. Memory Manager: The memory manager now handles the KV cache for both the draft and target models, ensuring smooth processing during speculative decoding.

Types of Speculative Decoding Supported in vLLM

vLLM supports three types of speculative decoding, each tailored to different workloads and performance needs:

Draft Model-Based Speculative Decoding

This is the most commonly used form of speculative decoding, where a smaller model predicts the next tokens, and a larger model verifies them. A common example would be using a Llama 68M model to predict tokens for a Llama 2 70B model. This approach requires careful selection of the draft model to balance accuracy and overhead.

Choosing the correct draft model is essential for maximizing the efficiency of speculative decoding. The draft model needs to be small enough to avoid creating significant overhead but still accurate enough to provide a meaningful performance boost.

However, selecting the right draft model can be challenging. For example, in models like Llama 3, finding a suitable draft model is difficult due to differences in vocabulary size. Speculative decoding requires that the draft and target models share the same vocabulary, and in some cases, this can limit the use of speculative decoding.

Prompt Lookup Decoding

Otherwise known as n-gram matching, this approach is effective for use cases like summarization and question-answering, where there is a significant overlap between the prompt and the answer. Instead of using a small model to propose tokens, the system speculates based on the information already available in the prompt. This works particularly well when the large model repeats parts of the prompt in its answers.

Medusa/Eagle/MLPSpeculator


Picture from https://github.com/FasterDecoding/Medusa

In this method, additional layers (or heads) are added to the large model itself, allowing it to predict multiple tokens in a single forward pass. This reduces the need for a separate draft model, instead leveraging the large model’s own capacity for parallel token generation. Though preliminary, this method shows promise for improving efficiency as more optimized kernels are developed.

Rejection Sampler and Speculative Decoding Worker

vLLM implements a rejection sampler as part of its speculative decoding framework. The sampler helps finalize which tokens are accepted and which are rejected, refining the overall accuracy of the process. Additionally, vLLM uses a speculative decoding worker to manage both the draft model and the target model’s token proposals and verifications, ensuring smooth operations during speculative decoding.

Speculative Decoding Performance Insights: Speedups and Trade-offs

Speculative decoding offers significant performance benefits in low-QPS (queries per second) environments. For example, in testing on the ShareGPT dataset, vLLM demonstrated up to a 1.5x speedup in token generation when using draft model-based speculative decoding. Similarly, prompt lookup decoding has shown speedups of up to 2.8x when applied to summarization datasets, such as CNN/DailyMail.

   
Performance comparison showing spec decode delivering up to 1.5x Speedup at QPS=1 Llama3-70B on ShareGPT with 4xH100 using draft model and up to 2.8x Speedup at QPS=1 Llama3-70B on CNN Dailymail with 4xH100 using n-grams

However, in high-QPS environments, speculative decoding may introduce performance trade-offs. The extra compute required to propose and verify tokens can sometimes slow down the system when it is already compute-bound, as seen when the number of requests per second increases. In such cases, the overhead of speculative decoding can outweigh its benefits, leading to reduced performance.


As high QPS, we see 1.4x slowdown Llama3-70B on ShareGPT with 4xH100, 1.8x slowdown Llama3-70B on CNN Dailymail with 4xH100

On the Roadmap: Dynamic Adjustments for Better Performance

To overcome the limitations of speculative decoding in high-QPS settings, vLLM is working on implementing dynamic speculative decoding. Feel free to check out the paper for more detail. This is also one of the active research directions in vllm! This feature will allow vLLM to adjust the number of speculative tokens based on system load and the accuracy of the draft model.

In the future, the system will be able to automatically modify the degree of speculation at each step, ensuring speculative decoding is always beneficial, regardless of the workload. This will allow users to activate speculative decoding without worrying about whether it will slow down their system.

How to Use Speculative Decoding in vLLM

Setting up speculative decoding in vLLM is straightforward. When launching the vLLM server, you simply need to include the necessary flags to specify the speculative model, the number of tokens, and the tensor parallel size.

The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time:

from vllm import LLM

llm = LLM(
    model="facebook/opt-6.7b",
    speculative_model="facebook/opt-125m",
    num_speculative_tokens=5,
)
outputs = llm.generate("The future of AI is")

for output in outputs:
    print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")

The following code configures vLLM to use speculative decoding where proposals are generated by matching n-grams in the prompt:

from vllm import LLM

llm = LLM(
    model="facebook/opt-6.7b",
    speculative_model="[ngram]",
    num_speculative_tokens=5,
    ngram_prompt_lookup_max=4,
    ngram_prompt_lookup_min=1,
)
outputs = llm.generate("The future of AI is")

for output in outputs:
    print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")

At times, you may want the draft model to operate with a different tensor parallel size than the target model to improve efficiency. This allows the draft model to use fewer resources and has less communication overhead, leaving the more resource-intensive computations to the target model. In vLLM, you can configure the draft model to use a tensor parallel size of 1, while the target model uses a size of 4, as demonstrated in the example below.

from vllm import LLM

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,
    speculative_model="ibm-fms/llama3-70b-accelerator",
    speculative_draft_tensor_parallel_size=1,
)
outputs = llm.generate("The future of AI is")

for output in outputs:
    print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")

For draft-model-based decoding, users specify the draft model and the number of tokens to speculate. vLLM also supports Ngram speculative decoding, where users only need to specify the number of tokens to speculate. Soon, vLLM will include automatic token speculation, removing the need for manual configuration altogether.

Future updates (paper, RFC) will allow vLLM to automatically choose the number of speculative tokens, removing the need for manual configuration and simplifying the process even further.

Follow our docs on Speculative Decoding in vLLM to get started. Join our bi-weekly office hours to ask questions and give feedback.

Conclusion: The Future of Speculative Decoding in vLLM

Speculative decoding in vLLM delivers substantial performance improvements, especially in low-QPS environments. As dynamic adjustments are introduced, it will become a highly effective tool even in high-QPS settings, making it a versatile and essential feature for reducing latency and increasing efficiency in LLM inference.

Interested in more advanced techniques for optimizing vLLM? Join us for our next vLLM Office Hours, where we explore new updates and features. Register here.