Plugin Architecture¶
The Spyre plugin extends or replaces three main components in vLLM:
- Scheduler
- Model worker and model runner
- Modeling code
To better understand these modifications, it's helpful to consider the state of the native vllm for GPU architecture.
The API server, the engine core, and the workers live in different processes. All three refer to the platform API for backend specific concerns.
In vLLM-Spyre, we implement a platform API that is loaded at the vLLM startup time and bootstraps all other components.
As we can see in the diagram, the plugin mainly modifies the engine core and worker processes. The platform API includes request validation hooks that the API server invokes to ensure that the requests can be handled by the backend.
In the engine core, we customize the scheduler to handle the constraints of static batching and continuous batching.
The changes are broader in the worker process. Most of the main classes have Spyre-specific implementations. From the vLLM code, we mainly reuse the sampling code (including logits processing) and the pooling code for non-generative use cases.
We provide model runners for three cases: static batching, continuous batching and
pooling. The pooling model runner is very similar to the static batching one,
except that it does pooling instead of sampling and
uses the transformers modeling code instead of the foundation-model-stack
code.