vLLM Blog
  • Apr 23, 2025

    Accelerating RLHF with vLLM, Best Practice from OpenRLHF

  • Apr 11, 2025

    Transformers backend integration in vLLM

  • Apr 5, 2025

    Llama 4 in vLLM

  • Feb 24, 2025

    PTPC-FP8: Boosting vLLM Performance on AMD ROCm

  • Feb 21, 2025

    Introducing AIBrix: A Scalable, Cost-Effective Control Plane for vLLM

  • Feb 17, 2025

    Distributed Inference with vLLM

  • Jan 27, 2025

    vLLM V1: A Major Upgrade to vLLM's Core Architecture

  • Jan 27, 2025

    Introducing vLLM Inference Provider in Llama Stack

  • Jan 21, 2025

    High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”

  • Jan 14, 2025

    Structured Decoding in vLLM: a gentle introduction

  • Jan 10, 2025

    vLLM 2024 Retrospective and 2025 Vision

  • Jan 10, 2025

    Installing and Developing vLLM with Ease

  • Oct 23, 2024

    Serving LLMs on AMD MI300X: Best Practices

  • Oct 17, 2024

    How Speculative Decoding Boosts vLLM Performance by up to 2.8x

  • Sep 5, 2024

    vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction

  • Jul 25, 2024

    vLLM’s Open Governance and Performance Roadmap

  • Jul 23, 2024

    Announcing Llama 3.1 Support in vLLM

  • Nov 14, 2023

    Notes on vLLM v.s. DeepSpeed-FastGen

  • Jun 20, 2023

    vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

Subscribe

  • © 2025. vLLM Team. All rights reserved.

vLLM is a fast and easy-to-use library for LLM inference and serving.