Contributing to vLLM Spyre¶
Thank you for your interest in contributing to the Spyre plugin for vLLM! There are several ways you can contribute:
- Identify and report any issues or bugs.
- Suggest or implement new features.
- Improve documentation or contribute a how-to guide.
Issues¶
If you encounter a bug or have a feature request, please search existing issues first to see if it has already been reported. If not, please create a new issue, providing as much relevant information as possible.
You can also reach out for support in the #sig-spyre channel in the vLLM Slack workspace.
Docs¶
Building the docs with MkDocs¶
Install MkDocs and Plugins¶
Install MkDocs along with the plugins used in the vLLM Spyre documentation.
Note
Ensure that your Python version is compatible with the plugins (e.g., mkdocs-awesome-nav requires Python 3.10+)
Start the Development Server¶
MkDocs comes with a built-in dev-server that lets you preview your documentation as you work on it.
Make sure you're in the same directory as the mkdocs.yml configuration file in the vllm-spyre repository, and then start the server by running the mkdocs serve command:
Example output:
INFO - Documentation built in 106.83 seconds
INFO - [22:02:02] Watching paths for changes: 'docs', 'mkdocs.yaml'
INFO - [22:02:02] Serving on http://127.0.0.1:8000/
View in Your Browser¶
Open up http://127.0.0.1:8000/ in your browser to see a live preview:.
Learn More¶
For additional features and advanced configurations, refer to the official MkDocs Documentation.
Testing¶
Tip
When running tests, if errors occur, these can be analyzed/debugged by setting DISABLE_ASSERTS = True in spyre_util.py and by rerunning the test using pytest --capture=no tests/spyre/test_spyre_basic.py. After debugging, DISABLE_ASSERTS should be reset to False.
Testing Locally on CPU (No Spyre card)¶
Optionally, download the ibm-ai-platform/micro-g3.3-8b-instruct-1b model:
python -c "from transformers import pipeline; pipeline('text-generation', model='ibm-ai-platform/micro-g3.3-8b-instruct-1b')"
Caution
The Hugging Face API download does not work on arm64.
By default, the model is saved to .cache/huggingface/hub/models--ibm-ai-platform--micro-g3.3-8b-instruct-1b.
Then, source the environment variables:
Optionally, install development dependencies:
Now, you can run the tests:
Here is a list of pytest markers you can use to filter them:
markers = [
"skip_global_cleanup",
"e2e: Tests using end-to-end engine spin-up",
"basic: Basic correctness tests",
"cb: Continuous batching tests",
"chunked_prefill: Tests with chunked prefill enabled",
"cpu: Tests using CPU (i.e. eager) backend",
"compat: backward compatibility tests",
"spyre: Tests using Spyre hardware backend",
"decoder: Tests for decoder models",
"embedding: Tests for embedding models",
"quantized: Tests for quantized models",
"multi: Tests that require >1 cards",
"utils: Tests for utility functions",
"worker: Tests for worker logic",
]
Testing Continuous Batching¶
Run the continuous batching tests:
Debugging¶
Tip
You can oc edit a pod and change the image without having the pod schedule to a different node. This can be useful for testing whether software or hardware is the issue.
-
The script
/opt/sentient/bin/aiu-query-devicesin the pod can be used to see the connectivity between theAIUson the machine. You can also infer this from environment variables with names likeAIU_TIER_\d_SET_\d_RANK_\d. -
SPYRE_DEVICEScan be used to select which devices will be selected for eachRANK. This is similar to howCUDA_VISIBLE_DEVICESworks for GPU.Example
0,2,4,6will assign rank0to AIU index0, rank1to AIU index2, rank2to AIU index4, and rank3to AIU index6.- An alternative is to use
AIU_WORLD_RANK_\d=0000:aa:00.0to explicitly map ranks toPCIaddresses (make sure there are no duplicates used at runtime).
- An alternative is to use
-
A bash script that uses
/opt/sentient/senlib/bin/senlib_unit_testto check eachAIUallocated to the pod to see if they work for a basic test:#!/bin/bash # A bash script that uses `/opt/sentient/senlib/bin/senlib_unit_test` # to check each AIU allocated to the pod to see if # they work for a basic test: cleanup_done=0 cleanup() { if [ "$cleanup_done" -eq 0 ] && [ -f ~/.senlib.json.bak ]; then echo "Restoring .senlib.json from backup" cp ~/.senlib.json.bak ~/.senlib.json cleanup_done=1 fi kill -- -$PPID wait exit } trap cleanup EXIT SIGINT # Create backup .senlib.json if it doesn't exist if [ -f "$HOME"/.senlib.json ]; then if [ ! -f "$HOME"/.senlib.json.bak ]; then echo "Creating backup of $HOME/.senlib.json" cp "$HOME"/.senlib.json "$HOME"/.senlib.json.bak else echo "$HOME/.senlib.json.bak already exists" fi fi for device_id in $(jq -r .GENERAL.sen_bus_id[] /etc/aiu/senlib_config.json); do echo "======================================================================" echo "Checking AIU ${device_id}" echo "======================================================================" jq -n '{"GENERAL": { "sen_bus_id": "'"${device_id}"'" }}' > .senlib.json # run in background to not override bash signal handler timeout 10 /opt/sentient/senlib/bin/senlib_unit_test --gtest_filter=SmlPF1VF0.Open & wait done
Logging levels¶
Various log levels that can be configured:
DTLOG_LEVEL-TRACE, DEBUG, INFO, WARNING, ERRORTORCH_SENDNN_LOG-WARNING, CRITICALVLLM_LOGGING_LEVEL-DEBUG, INFO, WARNING, ERRORDT_DEEPRT_VERBOSE-0, -1
Tip
DTLOG_LEVEL=INFO (piped to file) can help you see what device addresses are actually in use. Look for the string Opened: SEN:VFIO.
Tip
Set DT_DEEPRT_VERBOSE to 0 to enable verbose compiler prints for debugging.
Tip
In order to stop massive log spew, this configuration is ideal:
For tensor-parallel debugging, you can enable an option to redirect all log output from each rank to an individual file.
Set VLLM_SPYRE_WORKER_LOG_REDIRECT_DIR to a local directory, and each rank will redirect stdout and stderr into their own file inside the directory.
This can be helpful to avoid having interleaved stack dumps from different ranks in stderr.
Performance Metrics¶
When deploying to kubernetes clusters, prometheus + grafana can be installed and configured to scrape metrics from vLLM's /metrics endpoint.
vLLM can also be configured to log performance metrics about every request to a local file.
Setting both VLLM_SPYRE_PERF_METRIC_LOGGING_ENABLED=1 and VLLM_SPYRE_PERF_METRIC_LOGGING_DIR=/some/path and ensuring that vLLM stat logging is enabled will generate metrics in /some/path/request_metrics.jsonl. A sample of this file looks like:
{"timestamp": "2025-10-10T12:25:17.544", "prefill_interrupt_seconds": 0, "decode_only_itl_seconds": 0.05045744727055232, "finish_reason": 1, "num_prompt_tokens": 1, "num_generation_tokens": 16, "max_tokens_param": 16, "e2e_latency_seconds": 0.9784879684448242, "queued_time_seconds": 6.0582999140024185e-05, "prefill_time_seconds": 0.220398832927458, "inference_time_seconds": 0.9772605419857427, "decode_time_seconds": 0.7568617090582848, "mean_time_per_output_token_seconds": 0.05045744727055232}
{"timestamp": "2025-10-10T12:25:19.632", "prefill_interrupt_seconds": 0, "decode_only_itl_seconds": 0.10008190000274529, "finish_reason": 1, "num_prompt_tokens": 1, "num_generation_tokens": 16, "max_tokens_param": 16, "e2e_latency_seconds": 2.0864057540893555, "queued_time_seconds": 0.2935298749944195, "prefill_time_seconds": 0.1466117500094697, "inference_time_seconds": 1.647840250050649, "decode_time_seconds": 1.5012285000411794, "mean_time_per_output_token_seconds": 0.10008190000274529}
{"timestamp": "2025-10-10T12:25:19.632", "prefill_interrupt_seconds": 0.14661192893981934, "decode_only_itl_seconds": 0.1000875825372835, "finish_reason": 1, "num_prompt_tokens": 1, "num_generation_tokens": 16, "max_tokens_param": 16, "e2e_latency_seconds": 2.0864808559417725, "queued_time_seconds": 0.1469848749693483, "prefill_time_seconds": 0.14646116609219462, "inference_time_seconds": 1.7943868330912665, "decode_time_seconds": 1.6479256669990718, "mean_time_per_output_token_seconds": 0.10986171113327145}
{"timestamp": "2025-10-10T12:25:19.632", "prefill_interrupt_seconds": 0.29317212104797363, "decode_only_itl_seconds": 0.10008799746477355, "finish_reason": 1, "num_prompt_tokens": 1, "num_generation_tokens": 16, "max_tokens_param": 16, "e2e_latency_seconds": 2.08658504486084, "queued_time_seconds": 0.0001724999165162444, "prefill_time_seconds": 0.14670966705307364, "inference_time_seconds": 1.9412017500726506, "decode_time_seconds": 1.794492083019577, "mean_time_per_output_token_seconds": 0.11963280553463847}
{"timestamp": "2025-10-10T12:25:19.632", "prefill_interrupt_seconds": 0.4400491714477539, "decode_only_itl_seconds": 0.10009045804229875, "finish_reason": 1, "num_prompt_tokens": 1, "num_generation_tokens": 16, "max_tokens_param": 16, "e2e_latency_seconds": 2.0868380069732666, "queued_time_seconds": 2.9250048100948334e-05, "prefill_time_seconds": 0.1447284579044208, "inference_time_seconds": 2.086134499986656, "decode_time_seconds": 1.9414060420822352, "mean_time_per_output_token_seconds": 0.12942706947214902}
Topology Aware Allocation¶
This section is specific to the AIU operator and scheduling workloads onto specific cards.
(TODO: link to docs once they exist)
-
This mode supports users to request a special set of AIU cards based on
PCItopology. By using this mode, we can guarantee to pick up AIU cards of a particular class in the node:Tier0provides a set of cards in the samePCIswitch.Tier1provides a set of cards from at most one-hop awayPCIswitch.Tier2provides a set of cards from at most two-hops awayPCIswitch.
-
Running a Multi AIU Job using
ibm.com/aiu_pf_tier0,tier1,tier2:- This resource type is used for picking up a topology aware card set, which is required to run tensor parallel (
TP) workloads more effectively. By usingtierXclass resource,TPusers can automatically get a best performing card set for the workload.
- This resource type is used for picking up a topology aware card set, which is required to run tensor parallel (
-
The maximum number of allocatable resources in each tier depends on the platform & cluster, but we can get up to:
Tier0-4cardsTier1-8cardsTier2-16cards
-
Devices in
tier0can dopeer-to-peer (P2P) RDMA, devices on different trees useHost DMAsharing files through/dev/shm.Warning
If you request cards greater than the cards supported by the switch, the pod will never be scheduled. In the above example, if you specify
ibm.com/aiu_pf_tier0: 5in your yaml, the pod will never be scheduled because the maximum set of cards intier0was specified as4.
Pull Requests¶
Linting¶
When submitting a PR, please make sure your code passes all linting checks. You can install the linting requirements using either uv or pip.
Using uv:
Using pip:
After installing the requirements, run the formatting script:
Then, make sure to commit any changes made by the formatter:
DCO and Signed-off-by¶
When contributing, you must agree to the DCO.Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.
Using -s with git commit will automatically add this header.
License¶
See LICENSE.