vLLM Semantic Router is the System Level Intelligence for Mixture-of-Models (MoM), bringing Collective Intelligence into LLM systems. It lives between users and models, capturing signals from requests, responses, and context to make intelligent routing decisions—including model selection, safety filtering (jailbreak, PII), semantic caching, and hallucination detection. For more background, see our initial announcement blog post.

We are thrilled to announce the release of vLLM Semantic Router v0.1, codename Iris—our first major release that marks a transformative milestone for intelligent LLM routing. Since our experimental launch in September 2025, we’ve witnessed extraordinary community growth: over 600 Pull Requests merged, 300+ Issues addressed, and contributions from more than 50 outstanding engineers worldwide. As we kick off 2026, we’re excited to deliver a production-ready semantic routing platform that has evolved dramatically from its origins.

Why Iris?

In Greek mythology, Iris (Ἶρις) served as the divine messenger who bridged the realms of gods and mortals, traveling on the arc of the rainbow to deliver messages across vast distances. This symbolism perfectly captures what vLLM Semantic Router v0.1 achieves: a bridge between users and diverse AI models, intelligently routing requests across different LLM providers and architectures.

What’s New in v0.1 Iris?

1. Architecture Overhaul: Signal-Decision Plugin Chain Architecture

Before: The early Semantic Router relied on a single-dimensional approach—classifying queries into one of 14 MMLU domain categories with statically orchestrated jailbreak, PII, and semantic caching capabilities.

Now: We’ve introduced the Signal-Decision Driven Plugin Chain Architecture, a complete reimagining of semantic routing that scales from 14 fixed categories to unlimited intelligent routing decisions.

The new architecture extracts six types of signals from user queries:

  • Domain Signals: MMLU-trained classification with LoRA extensibility
  • Keyword Signals: Fast, interpretable regex-based pattern matching
  • Embedding Signals: Scalable semantic similarity using neural embeddings
  • Factual Signals: Fact-check classification for hallucination detection
  • Feedback Signals: User satisfaction/dissatisfaction indicators
  • Preference Signals: Personalization based on user defined preferences

These signals serve as inputs to a flexible decision engine that combines them using AND/OR logic with priority-based selection. Previously static features like jailbreak detection, PII protection, and semantic caching are now configurable plugins that users can enable per-decision:

Plugin Purpose
semantic-cache Cache similar queries for cost optimization
jailbreak Detect prompt injection attacks
pii Protect sensitive information
hallucination Real-time hallucination detection
system_prompt Inject custom instructions
header_mutation Modify HTTP headers for metadata propagation

This modular design enables unlimited extensibility—new signals, plugins, and model selection algorithms can be added without architectural changes. Learn more in our Signal-Decision Architecture blog post.

2. Performance Optimization: Modular LoRA Architecture

In collaboration with the Hugging Face Candle team, we’ve completely refactored the router’s inference kernel. The previous implementation required loading and running multiple fine-tuned models independently—computational cost grew linearly with the number of classification tasks.

The breakthrough: By adopting Low-Rank Adaptation (LoRA), we now share base model computation across all classification tasks:

Approach Workload Scalability
Before N full model forward passes O(n)
After 1 base model pass + N lightweight LoRA adapters O(1) + O(n×ε)

Note: Here ε represents the relative cost of a LoRA adapter forward pass compared to the full base model—typically ε « 1, making the additional overhead negligible.

This architecture delivers significant latency reduction while enabling multi-task classification on the same input. See the full technical details in our Modular LoRA blog post.

3. Safety Enhancement: HaluGate Hallucination Detection

Beyond request-time safety (jailbreak, PII), v0.1 introduces HaluGate—a three-stage hallucination detection pipeline for LLM responses:

Stage 1: HaluGate Sentinel – Binary classification determining if a query warrants factual verification (creative writing and code don’t need fact-checking).

Stage 2: HaluGate Detector – Token-level detection identifying exactly which tokens in the response are unsupported by the provided context.

Stage 3: HaluGate Explainer – NLI-based classification explaining why each flagged span is problematic (CONTRADICTION vs NEUTRAL).

HaluGate integrates seamlessly with function-calling workflows—tool results serve as ground truth for verification. Detection results are propagated via HTTP headers, enabling downstream systems to implement custom policies. Dive deeper in our HaluGate blog post.

4. UX Improvements: One-Command Installation

Local Development:

pip install vllm-sr

Get started in seconds with a single pip command. The package includes all core dependencies for quickstart.

Configuration: After installation, run vllm-sr init to generate the default config.yaml. Then configure your LLM backends in the providers section:

providers:
  models:
    - name: "openai/gpt-oss-120b"       # Local vLLM endpoint
      endpoints:
        - endpoint: "localhost:8000"
          protocol: "http"
      access_key: "your-vllm-api-key"
    - name: "openai/gpt-4"              # External provider
      endpoints:
        - endpoint: "api.openai.com"
          protocol: "https"
      access_key: "sk-xxxxxx"
  default_model: "openai/gpt-oss-120b"

See the configuration documentation for full details.

Kubernetes Deployment:

helm install semantic-router oci://ghcr.io/vllm-project/charts/semantic-router

Production-ready Helm charts with sensible defaults and extensive customization options. It helps you deploy vLLM Semantic Router in Kubernetes with ease.

Dashboard: A comprehensive web console for managing intelligent routing policies, model configurations, and an interactive chat playground for testing routing decisions in real-time. Visualize routing flows, monitor latency distributions, and fine-tune classification thresholds—all from an intuitive browser-based interface.

5. Ecosystem Integration

vLLM Semantic Router v0.1 integrates seamlessly with the broader AI infrastructure ecosystem:

Inference Frameworks:

  • vLLM Production Stack – Reference stack for production vLLM deployment with Helm charts, request routing, and KV cache offloading
  • NVIDIA Dynamo – Datacenter-scale distributed inference framework for multi-GPU, multi-node serving with disaggregated prefill/decode
  • llm-d – Kubernetes-native distributed inference stack for achieving SOTA performance across accelerators (NVIDIA, AMD, Google TPU, Intel XPU)
  • vLLM AIBrix – Open-source GenAI infrastructure building blocks for scalable LLM serving

API Gateways:

  • Envoy AI Gateway – Unified access to generative AI services built on Envoy Gateway with multi-provider support
  • Istio – Open-source service mesh for enterprise deployments with traffic management, security, and observability

6. MoM (Mixture of Models) Family

We’re proud to introduce the MoM Family—a comprehensive suite of specialized models purpose-built for semantic routing:

Model Purpose
mom-domain-classifier MMLU-based domain classification
mom-pii-classifier PII detection and protection
mom-jailbreak-classifier Prompt injection detection
mom-halugate-sentinel Fact-check classification
mom-halugate-detector Token-level hallucination detection
mom-halugate-explainer NLI-based explanation
mom-toolcall-sentinel Tool selection classification
mom-toolcall-verifier Tool call verification
mom-feedback-detector User feedback analysis
mom-embedding-x Semantic embedding extraction

All MoM models are specifically trained and optimized for vLLM Semantic Router, providing consistent performance across routing scenarios.

7. Responses API Support

We now support the OpenAI Responses API (/v1/responses) with in-memory conversation state management:

  • Stateful Conversations: Built-in state management with previous_response_id chaining
  • Multi-turn Context: Automatic context preservation across conversation turns
  • Routing Continuity: Intent classification history maintained across the conversation

This enables intelligent routing for modern agent frameworks and multi-turn applications.

8. Tool Selection

Intelligent tool management for agentic workflows:

  • Semantic Tool Filtering: Automatically filter irrelevant tools before sending to LLM
  • Context-Aware Selection: Consider conversation history and task requirements
  • Reduced Token Usage: Smaller tool catalogs mean faster inference and lower costs

Looking Ahead: v0.2 Roadmap

While v0.1 Iris establishes a solid foundation, we’re already planning significant enhancements for v0.2:

Signal-Decision Architecture Enhancements

  • More Signal Types: Extract additional valuable signals from user queries
  • Improved Accuracy: Enhance existing signal computation precision
  • Signal Composer: Design a signal composition layer for complex signal extraction and improved performance

Model Selection Algorithms

Building on the Signal-Decision foundation, we’re researching intelligent model selection algorithms:

  • ML-based Techniques: KNN, KMeans, MLP, SVM, Matrix Factorization
  • Advanced Methods: Elo rating, RouterDC, AutoMix, Hybrid approaches
  • Graph-based Selection: Leverage model relationship graphs
  • Size-aware Routing: Optimize based on model size vs. task complexity

Out-of-Box Plugins

  • Memory Plugin: Persistent conversation memory management
  • Router Replay: Debug and replay routing decisions and feedback

Multi-turn Algorithm Exploration

  • Response API Enhancement: Extended stateful conversation support with extensible backends like Redis, Milvus, and Memcached.
  • Context Engineering: Context compression and memory management
  • RL-driven Selection: Reinforcement learning for user preference-driven model selection

MoM Enhancements

  • Pre-train Base Model: Longer context window for signal extraction
  • Post-train SLM: Human preference signal extraction
  • Model Migration: Replace existing models with self-trained alternatives

Safety Enhancements

  • Tool Calling Jailbreak Detection: Protect against malicious tool invocations
  • Multi-turn Guardrails: Safety across conversation sessions
  • Improved Hallucination Accuracy: Higher precision hallucination detection

Intelligent Tool Management

  • Tool Completion: Auto-complete tool definitions and calling based on intents.
  • Advanced Tool Filtering: More sophisticated relevance filtering

UX & Operations

  • Dashboard Enhancements: Improved visualization and management capabilities
  • Helm Chart Improvements: More configuration options and deployment patterns

Evaluation

  • Working with RouterArena Team on comprehensive router evaluation frameworks

Acknowledgments

vLLM Semantic Router v0.1 Iris represents a truly global collaboration. We gratefully acknowledge the contributions from organizations including Red Hat, IBM Research, AMD, Hugging Face, and many others.

We’re proud to welcome our growing committer community:

Senan Zedan, samzong, Liav Weiss, Asaad Balum, Yehudit, Noa Limoy, JaredforReal, Abdallah Samara, Hen Schwartz, Srinivas A, carlory, Yossi Ovadia, Jintao Zhang, yuluo-yx, cryo-zd, OneZero-Y, aeft

And to the 50+ contributors who helped make this release possible—thank you!


Get Started

Ready to try vLLM Semantic Router v0.1 Iris?

pip install vllm-sr

Join the Community

We believe the future of intelligent routing is built together. Whether you’re a company looking to integrate intelligent routing into your AI infrastructure, a researcher exploring new frontiers in semantic understanding, or an individual developer passionate about open-source AI—we welcome your participation.

Ways to contribute:

  • Organizations: Partner with us on integrations, sponsor development, or contribute engineering resources
  • Researchers: Collaborate on papers, propose new algorithms, or help benchmark performance
  • Developers: Submit PRs, report issues, improve documentation, or build community plugins
  • Community: Share use cases, write tutorials, translate docs, or help answer questions

Every contribution matters—from fixing a typo to architecting a new feature. Join us in shaping the next generation of semantic routing infrastructure.

The rainbow bridge is now open. Welcome to Iris. 🌈