AMD × vLLM Semantic Router: Building the System Intelligence Together
Introduction
Over the past several months, AMD and the vLLM SR Team have been collaborating to bring vLLM Semantic Router (VSR) to AMD GPUs—not just as a performance optimization, but as a fundamental shift in how we think about AI system architecture.
AMD has been a long-term technology partner for the vLLM community, from accelerating the vLLM inference engine on AMD GPUs and ROCm™ Software to now co-building the next layer of the AI stack: intelligent routing and governance for Mixture-of-Models (MoM) systems.
As AI moves from single models to multi-model architectures, the challenge is no longer “how big is your model” but how intelligently and safely you orchestrate many models together. VSR is designed to be the intelligent control plane for this new era—making routing decisions based on semantic understanding, enforcing safety policies, and maintaining trust as systems scale toward AGI-level capabilities.

This collaboration focuses on three strategic pillars:
- Signal-Based Routing: Intelligent request routing using keyword matching, domain classification, semantic similarity, and fact-checking for Multi-LoRA and multi-model deployments
- Cross-Instance Intelligence: Shared state and optimization across vLLM instances through centralized response storage and semantic caching
- Guardrails & Governance: Enterprise-grade security from PII detection and jailbreak prevention to hallucination detection and alignment enforcement
Together with AMD, we’re building VSR to run efficiently on AMD GPUs while establishing a new standard for trustworthy, governable AI infrastructure.
The Shift: From Single Models to Mixture-of-Models
In a Mixture-of-Models world, an enterprise AI stack typically includes:
- Router SLMs (small language models) that classify, route, and enforce policy
- Multiple LLMs and domain-specific models (e.g., code, finance, healthcare, legal)
- Tools, RAG pipelines, vector search, and business systems
Without a robust routing layer, this becomes an opaque and fragile mesh. The AMD × VSR collaboration aims to make routing a first-class, GPU-accelerated infrastructure component—not an ad-hoc script glued between services.
VSR Core Capabilities
1. Signal-Based Routing for Multi-LoRA Deployments
VSR provides multiple routing strategies to match different use cases:
- Keyword-based routing: Simple pattern matching for fast, deterministic routing
- Domain classification: Intent-aware adapter selection using trained classifiers
- Embedding-based semantic similarity: Nuanced routing based on semantic understanding
- Fact-checking and verification routing: High-stakes queries routed to specialized verification pipelines
2. Cross-Instance Intelligence
VSR enables shared state and optimization across all vLLM instances:
- Response API: Centralized response storage enabling stateful multi-turn conversations
- Semantic Cache: Significant token reduction through cross-instance vector similarity matching
3. Enterprise-Grade Guardrails
From single-turn to multi-turn conversations, VSR provides:
- PII Detection: Prevent sensitive information leakage
- Jailbreak Prevention: Block malicious prompt injection attempts
- Hallucination Detection: Verify response reliability for critical domains
- Super Alignment: Ensuring AI systems remain aligned with human values and intentions as they scale toward AGI capabilities
Running VSR on AMD GPUs: Two Deployment Paths
Our near-term objective is execution-oriented: deliver a production-grade VSR solution that runs efficiently on AMD GPUs. We’re building two complementary deployment paths:

Path 1: vLLM-Based Inference on AMD GPUs
Using the vLLM engine on AMD GPUs, we run:
Router SLMs for:
- Task and intent classification
- Risk scoring and safety gating
- Tool and workflow selection
LLMs and specialized models for:
- General assistance
- Domain-specific tasks (finance, legal, code, healthcare)
VSR sits above as the decision fabric, consuming semantic similarity, business metadata, latency constraints, and compliance requirements to perform dynamic routing across models and endpoints.
AMD GPUs provide the throughput and memory footprint needed to run router SLMs + multiple LLMs in the same cluster, supporting high-QPS workloads with stable latency—not just one-off demos.
Path 2: Lightweight ONNX-Based Routing
Not all routing needs a full inference stack. For ultra-high-frequency, latency-sensitive stages at the “front door” of the system, we’re enabling:
- Exporting router SLMs to ONNX
- Running them on AMD GPUs through ONNX Runtime
- Forwarding complex generative work to vLLM or other back-end LLMs
This lightweight path is designed for:
- Front-of-funnel traffic classification and triage
- Large-scale policy evaluation and offline experiments
- Enterprises that want to standardize on AMD GPUs while keeping model providers flexible
Moving to the Next Stage of Semantic Router
When we first built vLLM Semantic Router, the goal was clear and practical: intelligent model selection—routing requests to the right model based on task type, cost constraints, and performance requirements.

vLLM Engine delivers the foundation—running large models stably and efficiently. vLLM Semantic Router provides the scheduler—dispatching requests to the right capabilities.
But as AI systems move toward AGI-level capabilities, this framing feels incomplete. It’s like discussing engine efficiency without addressing brakes, traffic laws, or safety systems.
The real challenge isn’t making models more powerful—it’s maintaining control as they become more powerful.
From Models Director to Intelligence Judger
Working with AMD, we’ve come to see Semantic Router’s evolution differently. Its potential lies not just in “routing,” but in governance—transforming from a traffic director into an Intelligence Control Plane for the AGI era.
This shift changes how we think about the collaboration. We’re not just optimizing for throughput and latency on AMD hardware. We’re building a constitutional layer for AI systems—one defined by responsibilities, not just features.
Three Control Lifelines That Must Be Secured
As we architect VSR on AMD’s infrastructure, we’re designing around three critical control points that determine whether AI systems remain trustworthy at scale:

1. World Output (Actions)
The most dangerous capability of powerful models isn’t reasoning—it’s execution. Every action that changes the world (tool calls, database writes, API invocations, configuration changes) must pass through an external checkpoint before execution.
With AMD GPUs, we can run these checkpoints inline at production scale—evaluating risk, enforcing policies, and logging decisions without becoming a bottleneck.
2. World Input (Inputs)
External inputs are untrusted by default. Web pages, retrieval results, uploaded files, and plugin returns can all carry prompt injection, data poisoning, or privilege escalation attempts.
VSR on AMD infrastructure provides border inspection before data reaches the model—running classifiers, sanitizers, and verification checks as a first line of defense, not an afterthought.
3. Long-Term State (Memory/State)
The hardest failures to fix aren’t wrong answers—they’re wrong answers that get written into long-term memory, system state, or automated workflows.
Our collaboration focuses on making state management a first-class concern: who can write, what can be written, how to undo, and how to isolate contamination. AMD’s GPU infrastructure enables us to run continuous verification and rollback mechanisms that keep state trustworthy over time.
The Ultimate Question
When these three lifelines are secured, Semantic Router stops being just a model selector. It becomes the answer to a fundamental question:
How do we transform alignment from a training-time aspiration into a runtime institution?
This is what the AMD × vLLM Semantic Router collaboration is really about: building not just faster routing, but trustworthy, governable AI infrastructure that can scale safely toward AGI-level capabilities.
Long-Term Vision and Ongoing Work
Our collaboration with AMD extends beyond near-term deployment to building the foundation for next-generation AI infrastructure. We’re working on several long-term initiatives:
Training a Next-Generation Router Model on AMD GPUs
As a longer-term goal, we aim to explore training a next-generation router model based on encoder-only on AMD GPUs, optimized for semantic routing, retrieval-augmented generation (RAG), and safety classification.
While recent encoder models (e.g., ModernBERT) show strong performance, they remain limited in context length, multilingual coverage, and alignment with emerging long-context attention techniques. This effort focuses on advancing encoder capabilities using AMD hardware, particularly for long-context, high-throughput representation learning.
The outcome will be an open encoder model designed to integrate with vLLM Semantic Router and modern AI pipelines, strengthening the retrieval and routing layers of AI systems while expanding hardware-diverse training and deployment options for the community and industry.
Community Public Beta on AMD Infrastructure
As part of this collaboration, each major release of vLLM Semantic Router will be accompanied by a public beta environment hosted on AMD-sponsored infrastructure, available free of charge to the community.
These public betas will allow users to:
- Validate new routing, caching, and safety features
- Gain hands-on experience with Semantic Router running on AMD GPUs
- Provide early feedback that helps improve performance, usability, and system design
By lowering the barrier to experimentation and validation, this initiative aims to strengthen the vLLM ecosystem, accelerate real-world adoption, and ensure that new Semantic Router capabilities are shaped by community input before broader production deployment.
AMD GPU-Powered CI/CD and End-to-End Testbed
In the long run, we aim to use AMD GPUs to underpin how VSR as an open-source project is built, validated, and shipped, ensuring VSR works consistently well with AMD GPUs as the project grows.
We are designing a GPU-backed CI/CD and end-to-end testbed where:
- Router SLMs, LLMs, domain models, retrieval, and tools run together on AMD GPU clusters
- Multi-domain, multi-risk-level datasets are replayed as traffic
- Each VSR change runs through an automated evaluation pipeline, including:
- Routing and policy regression tests
- A/B comparisons of new vs. previous strategies
- Stress tests on latency, cost, and scalability
- Focused suites for hallucination mitigation and compliance behavior
The target state is clear:
Every VSR release comes with a reproducible, GPU-driven evaluation report, not just a changelog.
AMD GPUs, in this model, are not only for serving models; they are the verification engine for the routing infrastructure itself.
An AMD-Backed Mixture-of-Models Playground
In parallel, we are planning an online Mixture-of-Models playground powered by AMD GPUs, open to the community and partners.
This playground will allow users to:
- Experiment with different routing strategies and model topologies under real workloads
- Observe, in a visual way, how VSR decides which model to call, when to retrieve, and when to apply additional checks or fallbacks
- Compare quality, latency, and cost trade-offs across configurations
For model vendors, tool builders, and platform providers, this becomes a neutral, AMD GPU-backed test environment to:
- Integrate their components into a MoM stack
- Benchmark under realistic routing and governance constraints
- Showcase capabilities within a transparent, observable system
Why This Collaboration Matters
Through the AMD × vLLM Semantic Router collaboration, we are aiming beyond “does this model run on this GPU”.
The joint ambitions are:
- To define a reference architecture for intelligent, GPU-accelerated routing on AMD platforms, including:
- vLLM-based inference paths,
- ONNX-based lightweight router paths,
- multi-model coordination and safety enforcement.
- To treat routing as trusted infrastructure, supported by:
- GPU-powered CI/CD and end-to-end evaluation,
- hallucination-aware and risk-aware policies,
- online learning and adaptive strategies.
- To provide the ecosystem with a long-lived, AMD GPU–backed MoM playground where ideas, models, and routing policies can be tested and evolved in the open.
In short, this is about co-building trustworthy, evolvable multi-model AI infrastructure—with AMD GPUs as a core execution and validation layer, and vLLM Semantic Router as the intelligent control plane that makes the entire system understandable, governable, and ready for real workloads.
The technical roadmap—hallucination detection, online learning, multi-model orchestration—serves this larger mission. AMD’s hardware provides the execution layer. VSR provides the control plane. Together, we’re building the foundation for AI systems that remain aligned not through hope, but through architecture.
Acknowledgements
We would like to thank the many talented people who have contributed to this collaboration:
- AMD: Andy Luo, Haichen Zhang, and the AMD AIG Teams.
- vLLM SR: Xunzhuo Liu, Huamin Chen, Chen Wang, Yue Zhu, and the vLLM Semantic Router OSS team.
We’re excited to keep refining and expanding our optimizations to unlock even greater capabilities in the weeks and months ahead!
Join Us
Looking for Collaborations! Calling all passionate community developers and researchers: join us in training the next-generation router model on AMD GPUs and building the future of trustworthy AI infrastructure.
Interested? Reach out to us:
- Haichen Zhang: haichzha@amd.com
- Xunzhuo Liu: xunzhuo@vllm-semantic-router.ai
Resources:
Join the discussion: Share your use cases and feedback in #semantic-router channel on vLLM Slack