Home
New Feature: Axolotl Sparse Finetuning Integration
Easily finetune sparse LLMs through our seamless integration with Axolotl. Learn more.
New Feature: AutoAWQ Integration
Perform low-bit weight-only quantization efficiently using AutoAWQ, now part of LLM Compressor. Learn more.
LLM Compressor

LLM Compressor is an easy-to-use library for optimizing large language models for deployment with vLLM, enabling up to 5X faster, cheaper inference. It provides a comprehensive toolkit for:
- Applying a wide variety of compression algorithms, including weight and activation quantization, pruning, and more
- Seamlessly integrating with Hugging Face Transformers, Models, and Datasets
- Using a
safetensors
-based file format for compressed model storage that is compatible withvLLM
- Supporting performant compression of large models via
accelerate
Key Features
- Weight and Activation Quantization: Reduce model size and improve inference performance for general and server-based applications with the latest research.
- Supported Algorithms: GPTQ, AWQ, SmoothQuant, RTN
- Supported Formats: INT W8A8, FP W8A8
- Weight-Only Quantization: Reduce model size and improve inference performance for latency sensitive applications with the latest research
- Supported Algorithms: GPTQ, AWQ, RTN
- Supported Formats: INT W4A16, INT W8A16
- Weight Pruning: Reduce model size and improve inference performance for all use cases with the latest research
- Supported Algorithms: SparseGPT, Magnitude, Sparse Finetuning
- Supported Formats: 2:4 (semi-structured), unstructured
Key Sections
-
Getting Started
Install LLM Compressor and learn how to apply your first optimization recipe.
-
Guides
Detailed guides covering compression schemes, algorithms, and advanced usage patterns.
-
Examples
Step-by-step examples for different compression techniques and model types.
-
Developer Resources
Information for contributors and developers extending LLM Compressor.