Saving a Model
The llmcompressor
library extends Hugging Face's save_pretrained
method with additional arguments to support model compression functionality. This document explains these extra arguments and how to use them effectively.
How It Works
When you import llmcompressor
, it automatically wraps the model's original save_pretrained
method with an enhanced version that supports compression. This happens in two ways:
- Direct modification: When you call
modify_save_pretrained(model)
directly - Automatic wrapping: When you call
oneshot(...)
, which wrapssave_pretrained
under the hood
This means that after applying compression with oneshot
, your model's save_pretrained
method is already enhanced with compression capabilities, and you can use the additional arguments described below.
Additional Arguments
When saving your compressed models, you can use the following extra arguments with the save_pretrained
method:
Parameter | Type | Default | Description |
---|---|---|---|
sparsity_config | Optional[SparsityCompressionConfig] | None | Optional configuration for sparsity compression. This should be provided if there's existing sparsity in the model. If None and skip_sparsity_compression_stats is False, configuration will be automatically inferred from the model. |
quantization_format | Optional[str] | None | Optional format string for quantization. If not provided, it will be inferred from the model. |
save_compressed | bool | True | Controls whether to save the model in a compressed format. Set to False to save in the original dense format. |
skip_sparsity_compression_stats | bool | True | Controls whether to skip calculating sparsity statistics (e.g., global sparsity and structure) when saving the model. Set to False to include these statistics. If you are not providing a sparsity_config , you should set this to False to automatically generate the config for you. |
disable_sparse_compression | bool | False | When set to True , skips any sparse compression during save, even if the model has been previously compressed. |
Workflow for Models with Existing Sparsity
When working with models that already have sparsity:
- If you know the sparsity configuration, provide it directly via
sparsity_config
- If you don't know the sparsity configuration, set
skip_sparsity_compression_stats
toFalse
to automatically infer it from the model
This workflow ensures that the correct sparsity configuration is either provided or generated when saving models with existing sparsity.
Examples
Applying Compression with oneshot
The simplest approach is to use oneshot
, which handles both compression and wrapping save_pretrained
:
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
# Load model
model = AutoModelForCausalLM.from_pretrained("your-model")
tokenizer = AutoTokenizer.from_pretrained("your-model")
# Apply compression - this also wraps save_pretrained
oneshot(
model=model,
recipe=[GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"])],
# Other oneshot parameters...
)
# Now you can use the enhanced save_pretrained
SAVE_DIR = "your-model-W8A8-compressed"
model.save_pretrained(
SAVE_DIR,
save_compressed=True # Use the enhanced functionality
)
tokenizer.save_pretrained(SAVE_DIR)
Manual Approach (Without oneshot)
If you need more control, you can wrap save_pretrained
manually:
from transformers import AutoModelForCausalLM
from llmcompressor.transformers.sparsification import modify_save_pretrained
# Load model
model = AutoModelForCausalLM.from_pretrained("your-model")
# Manually wrap save_pretrained
modify_save_pretrained(model)
# Now you can use the enhanced save_pretrained
model.save_pretrained(
"your-model-path",
save_compressed=True,
skip_sparsity_compression_stats=False # To automatically infer sparsity config
)
Saving with Custom Sparsity Configuration
from compressed_tensors.sparsification import SparsityCompressionConfig
# Create custom sparsity config
custom_config = SparsityCompressionConfig(
format="2:4",
block_size=16
)
# Save with custom config
model.save_pretrained(
"your-model-custom-sparse",
sparsity_config=custom_config,
)
Notes
- When loading compressed models with
from_pretrained
, the compression format is automatically detected. - To use compressed models with vLLM, simply load them as you would any model:
- Compression configurations are saved in the model's config file and are automatically applied when loading.
For more information about compression algorithms and formats, please refer to the documentation and examples in the llmcompressor repository.