vLLM Now Supports gpt-oss
We’re thrilled to announce that vLLM now supports gpt-oss on NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs. In this blog post, we’ll explore the efficient model architecture of gpt-oss and how vLLM supports it.
To quickly get started with gpt-oss, you try our container:
docker run --gpus all \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:gptoss \
--model openai/gpt-oss-20b
or install it in your virtual environment
uv pip install --pre vllm==0.10.1+gptoss \
--extra-index-url https://wheels.vllm.ai/gpt-oss/ \
--extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
--index-strategy unsafe-best-match
vllm serve openai/gpt-oss-120b
See vLLM User Guide for more detail.
MXFP4 MoE
gpt-oss is a sparse MoE model with 128 experts (120B) or 32 experts (20B), where each token is routed to 4 experts (with no shared expert). For the MoE weights, it uses MXFP4, a novel group-quantized floating-point format, while it uses the standard bfloat16 for attention and other layers. Since MoE takes the majority of the model parameters, using MXFP4 for MoE weights alone reduces the model sizes to 63 GB (120B) and 14 GB (20B), making them runnable on a single GPU (while often not recommended for the best performance)!
In MXFP4, each weight is represented as a 4-bit floating-point (fp4 e2m1). Additionally, MXFP4 introduces a power-of-two scaling factor for each group of 32 consecutive fp4 values, to represent a wide numerical range. When it runs on hardware, two fp4 values are packed into a single 8-bit unit in memory, and then unpacked on the fly within the matmul kernel for computation.
To efficiently run MXFP4 MoE, vLLM has integrated two specialized GPU kernels via collaboration with OpenAI and NVIDIA:
- Blackwell GPUs (e.g., B200): A new MoE kernel from FlashInfer. This kernel is implemented by NVIDIA and uses Blackwell’s native MXFP4 tensor cores for maximum performance.
- Hopper GPUs (e.g., H100, H200): Triton
matmul_ogs
kernel, officially implemented by the OpenAI Triton team. This kernel is optimized specifically for Hopper architectures, includes the swizzling optimization and built-in heuristics, removing the need for manual tuning.
Efficient Attention
gpt-oss has a highly efficient attention design. It uses GQA with 64 query heads and 8 KV heads. Importantly, the model interleaves full attention and sliding window attention (with window size 128) with 1:1 ratio. Furthermore, the head size of the model is 64, 50% of the standard head size 128. Finally, each query head has a trained “attention sink” vector.
To efficiently support this attention, vLLM has integrated special GPU kernels from FlashInfer (Blackwell) and FlashAttention 3 (Hopper). Also, we enhanced our Triton attention kernel to support this on AMD GPUs.
Furthermore, to efficiently manage the KV cache with different types of attention (i.e., full and sliding window), vLLM has integrated the hybrid KV cache allocator, a novel technique proposed by the vLLM team. With the hybrid KV cache manager, vLLM can dynamically share the KV cache space between the full attention layers and sliding window attention layers, reducing the potential memory fragmentation down to zero.
Built-in Tool Support: Agent Loop & Tool Server via MCP
gpt-oss includes built-in support for powerful tools, such as web browsing and Python code interpreter. When enabled, the model autonomously decides when and how to invoke these tools, interpreting the results seamlessly.
vLLM natively supports these capabilities by integrating the OpenAI Responses API and the gpt-oss toolkit. Through this integration, vLLM implements a loop to parse the model’s tool call, actually invoke the search and code interpreter tools, parse their outputs, and send them back to the model.
Alternatively, users can launch an MCP-compliant external tool server, to let vLLM use the tool server instead of directly leveraging the gpt-oss toolkit. This modular architecture simplifies the creation of scalable tool-calling libraries and services, requiring no internal changes to vLLM.
Looking Ahead
This announcement is just the beginning of vLLM’s continued optimization for gpt-oss. Our ongoing roadmap includes:
-
Hardening the Responses API
-
Further enhancing attention DP and MoE EP support
-
Reducing CPU overhead to maximize throughput
Acknowledgement
vLLM team members who contributed to this effort are: Yongye Zhu, Woosuk Kwon, Chen Zhang, Simon Mo, Kaichao You.
Jay Shah from Colfax International implemented the necessary changes to adapt to attention sinks and uncovered optimizations in the FA3 algorithm for gpt-oss.
We want to thank OpenAI for the amazing partnership: Zhuohan Li, Xiaoxuan Liu, Philippe Tillet, Mario Lezcano-Casado, Dominik Kundel, Casey Dvorak, Vol Kyrylov.
NVIDIA and vLLM worked closely to develop and verify both performance and accuracy on NVIDIA Blackwell architecture: Duncan Moss, Grace Ho, Julien Demouth, Minseok Lee, Siyuan Fu, Zihao Ye, Pen Chung Li.
The AMD team contributed significantly to the integration of the model on their devices: Hongxia Yang, Ali Zaidy, with great support from Peng Sun, Vinayak Gokhale, Andy Luo
The Hugging Face team continues to be amazing at building an open source ecosystem: Lysandre, Hugo, Marc, vb, Arthur, Mohamed, Andrien.
Finally, we want to thank all the partners that leveraged vLLM in some way and delivered valuable feedback and improvements to this effort: AWS, Cloudflare, Snowflake, Databricks, Together, Fireworks, Cerebras.