Home

New Feature: Axolotl Sparse Finetuning Integration

Easily finetune sparse LLMs through our seamless integration with Axolotl. Learn more.

New Feature: AutoAWQ Integration

Perform low-bit weight-only quantization efficiently using AutoAWQ, now part of LLM Compressor. Learn more.

LLM Compressor

LLM Compressor Flow

LLM Compressor is an easy-to-use library for optimizing large language models for deployment with vLLM, enabling up to 5X faster, cheaper inference. It provides a comprehensive toolkit for:

Applying a wide variety of compression algorithms, including weight and activation quantization, pruning, and more
Seamlessly integrating with Hugging Face Transformers, Models, and Datasets
Using a safetensors-based file format for compressed model storage that is compatible with vLLM
Supporting performant compression of large models via accelerate

Key Features

Weight and Activation Quantization: Reduce model size and improve inference performance for general and server-based applications with the latest research.
- Supported Algorithms: GPTQ, AWQ, SmoothQuant, RTN
- Supported Formats: INT W8A8, FP W8A8
Weight-Only Quantization: Reduce model size and improve inference performance for latency sensitive applications with the latest research
- Supported Algorithms: GPTQ, AWQ, RTN
- Supported Formats: INT W4A16, INT W8A16
Weight Pruning: Reduce model size and improve inference performance for all use cases with the latest research
- Supported Algorithms: SparseGPT, Magnitude, Sparse Finetuning
- Supported Formats: 2:4 (semi-structured), unstructured

Key Sections

Getting Started

Install LLM Compressor and learn how to apply your first optimization recipe.

Getting started
Guides

Detailed guides covering compression schemes, algorithms, and advanced usage patterns.

Guides
Examples

Step-by-step examples for different compression techniques and model types.

Examples
Developer Resources

Information for contributors and developers extending LLM Compressor.

Developer Resources