llmcompressor.modifiers.quantization
GPTQModifier
Bases: Modifier
, QuantizationMixin
Implements the GPTQ algorithm from https://arxiv.org/abs/2210.17323. This modifier uses activations to calibrate a hessian matrix, which is then used to determine optimal quantizion values and orderings for the model weights.
| Sample yaml: | test_stage: | obcq_modifiers: | GPTQModifier: | block_size: 128 | dampening_frac: 0.001 | offload_hessians: False | config_groups: | group_0: | targets: | - "Linear" | input_activations: null | output_activations: null | weights: | num_bits: 8 | type: "int" | symmetric: true | strategy: "tensor" | group_size: 128 | actorder: False
Lifecycle: - on_initialize - apply config to model - on_start - add activation calibration hooks - add gptq weight calibration hooks - on_sequential_epoch_end - quantize_weight - on_finalize - remove_hooks() - model.apply(freeze_module_quantization)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sequential_targets | list of layer names to compress during GPTQ, or 'ALL' to compress every layer in the model | required | |
block_size | Used to determine number of columns to compress in one pass | required | |
dampening_frac | Amount of dampening to apply to H, as a fraction of the diagonal norm | required | |
offload_hessians | Set to True for decreased memory usage but increased runtime. | required | |
config_groups | dictionary specifying quantization schemes to apply to target modules. Modules not matching a scheme target will NOT be quantized. | required | |
targets | list of layer names to quantize if a scheme is provided. Defaults to Linear layers | required | |
ignore | optional list of module class names or submodule names to not quantize even if they match a target in config_groups. Defaults to empty list. | required | |
scheme | a single quantization scheme to apply to the model. This is a dictionary that supports all keys from QuantizationScheme except targets, which will be set to the targets parameter set at the modifier level. Can also be set to a dictionary of the format | required | |
kv_cache_scheme | optional QuantizationArgs, that specify the quantization of the kv cache. If None, kv cache is not quantized. When applying kv cache quantization to transformer AutoModelForCausalLM, the kv_cache_scheme gets converted into a QuantizationScheme that: - targets the | required |
Source code in src/llmcompressor/modifiers/quantization/gptq/base.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 |
|
calibrate_module(module, args, _output)
Calibration hook used to accumulate the hessian of the input to the module
Parameters:
Name | Type | Description | Default |
---|---|---|---|
module | Module | module being calibrated | required |
args | Tuple[Tensor, ...] | inputs to the module, the first element of which is the cannonical input | required |
_output | Tensor | uncompressed module output, unused | required |
Source code in src/llmcompressor/modifiers/quantization/gptq/base.py
compress_modules()
Quantize modules which have been calibrated
Source code in src/llmcompressor/modifiers/quantization/gptq/base.py
on_end(state, event, **kwargs)
Finish calibrating by removing observers and calibration hooks
Source code in src/llmcompressor/modifiers/quantization/gptq/base.py
on_finalize(state, **kwargs)
disable the quantization observers used by the OBCQ algorithm
Parameters:
Name | Type | Description | Default |
---|---|---|---|
state | State | session state storing input model and calibration data | required |
Source code in src/llmcompressor/modifiers/quantization/gptq/base.py
on_initialize(state, **kwargs)
Initialize and run the GPTQ algorithm on the current state
Parameters:
Name | Type | Description | Default |
---|---|---|---|
state | State | session state storing input model and calibration data | required |
Source code in src/llmcompressor/modifiers/quantization/gptq/base.py
Observer
Bases: Module
, RegistryMixin
Base Observer class to be subclassed for specific implementation. Subclasses should override calculate_qparams
to return a scale, zero_point pair
Source code in src/llmcompressor/observers/base.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
|
calculate_qparams(observed, reduce_dims=None)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
observed | Tensor | observed tensor to calculate quantization parameters for | required |
reduce_dims | Optional[Tuple[int]] | optional tuple of dimensions to reduce along, returned scale and zero point will be shaped (1,) along the reduced dimensions | None |
Returns:
Type | Description |
---|---|
Tuple[FloatTensor, IntTensor] | tuple of scale and zero point derived from the observed tensor |
Source code in src/llmcompressor/observers/base.py
forward(observed, g_idx=None)
maps directly to get_qparams
Parameters:
Name | Type | Description | Default |
---|---|---|---|
observed | Tensor | optional observed tensor from which to calculate quantization parameters | required |
g_idx | Optional[Tensor] | optional mapping from column index to group index | None |
Returns:
Type | Description |
---|---|
Tuple[FloatTensor, IntTensor] | tuple of scale and zero point based on last observed value |
Source code in src/llmcompressor/observers/base.py
get_qparams(observed=None, g_idx=None)
Convenience function to wrap overwritten calculate_qparams adds support to make observed tensor optional and support for tracking latest calculated scale and zero point
Parameters:
Name | Type | Description | Default |
---|---|---|---|
observed | Optional[Tensor] | optional observed tensor to calculate quantization parameters from | None |
g_idx | Optional[Tensor] | optional mapping from column index to group index | None |
Returns:
Type | Description |
---|---|
Tuple[FloatTensor, IntTensor] | tuple of scale and zero point based on last observed value |
Source code in src/llmcompressor/observers/base.py
post_calculate_qparams()
record_observed_tokens(batch_tensor)
Counts the number of tokens observed during the forward passes. The count is aggregated in the _num_observed_tokens attribute of the class.
Note: The batch_tensor is expected to have two dimensions (batch_size * sequence_length, num_features). This is the general shape expected by the forward pass of the expert layers in a MOE model. If the input tensor does not have two dimensions, the _num_observed_tokens attribute will be set to None.
Source code in src/llmcompressor/observers/base.py
QuantizationMixin
Bases: HooksMixin
Mixin which enables a Modifier to act as a quantization config, attching observers, calibration hooks, and compression wrappers to modifiers
Lifecycle: - on_initialize: QuantizationMixin.initialize_quantization - Attach schemes to modules - Attach observers to modules - Disable quantization until calibration starts/finishes - on_start: QuantizationMixin.start_calibration - Attach calibration hooks - Apply calibration status - Enable quantization during calibration - on_end: QuantizationMixin.end_calibration - Remove calibration hooks - Apply freeze status - Keep quantization enabled for future steps
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config_groups | dictionary specifying quantization schemes to apply to target modules. Modules not matching a scheme target will NOT be quantized. | required | |
targets | list of layer names to quantize if a scheme is provided. Defaults to Linear layers | required | |
ignore | optional list of module class names or submodule names to not quantize even if they match a target in config_groups. Defaults to empty list. | required | |
scheme | a single quantization scheme to apply to the model. This is a dictionary that supports all keys from QuantizationScheme except targets, which will be set to the targets parameter set at the modifier level. Can also be set to a dictionary of the format | required | |
kv_cache_scheme | optional QuantizationArgs, that specify the quantization of the kv cache. If None, kv cache is not quantized. When applying kv cache quantization to transformer AutoModelForCausalLM, the kv_cache_scheme gets converted into a QuantizationScheme that: - targets the | required |
Source code in src/llmcompressor/modifiers/quantization/quantization/mixin.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 |
|
end_calibration(model)
Remove calibration hooks and set the model status to frozen. Keep quantization enabled for future operations
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model | Module | model to end calibration for | required |
Source code in src/llmcompressor/modifiers/quantization/quantization/mixin.py
has_config()
Determine if the user has specified a quantization config on this modifier
Source code in src/llmcompressor/modifiers/quantization/quantization/mixin.py
initialize_quantization(model)
Attach quantization schemes and observers to modules in the model according to the quantization config specified on this modifier
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model | Module | model to attach schemes and observers to | required |
Source code in src/llmcompressor/modifiers/quantization/quantization/mixin.py
resolve_quantization_config()
Returns the quantization config specified by this modifier
Source code in src/llmcompressor/modifiers/quantization/quantization/mixin.py
start_calibration(model)
Register activation calibration hooks (including kv_cache quantization) and enable quantization as we calibrate
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model | Module | model to prepare for calibration | required |
Source code in src/llmcompressor/modifiers/quantization/quantization/mixin.py
QuantizationModifier
Bases: Modifier
, QuantizationMixin
Enables post training quantization (PTQ) and quantization aware training (QAT) for a given module or its submodules. After calibration (PTQ) or the start epoch (QAT), the specified module(s) forward pass will emulate quantized execution and the modifier will be enabled until training is completed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config_groups | dictionary specifying quantization schemes to apply to target modules. Modules not matching a scheme target will NOT be quantized. | required | |
targets | list of layer names to quantize if a scheme is provided. Defaults to Linear layers | required | |
ignore | optional list of module class names or submodule names to not quantize even if they match a target in config_groups. Defaults to empty list. | required | |
scheme | a single quantization scheme to apply to the model. This is a dictionary that supports all keys from QuantizationScheme except targets, which will be set to the targets parameter set at the modifier level. Can also be set to a dictionary of the format | required | |
kv_cache_scheme | optional QuantizationArgs, that specify the quantization of the kv cache. If None, kv cache is not quantized. When applying kv cache quantization to transformer AutoModelForCausalLM, the kv_cache_scheme gets converted into a QuantizationScheme that: - targets the | required |
Source code in src/llmcompressor/modifiers/quantization/quantization/base.py
on_end(state, event, **kwargs)
Finish calibrating by removing observers and calibration hooks
Source code in src/llmcompressor/modifiers/quantization/quantization/base.py
on_initialize(state, **kwargs)
Prepare to calibrate activations and weights
According to the quantization config, a quantization scheme is attached to each targeted module. The module's forward call is also overwritten to perform quantization to inputs, weights, and outputs.
Then, according to the module's quantization scheme, observers and calibration hooks are added. These hooks are disabled until the modifier starts.
Source code in src/llmcompressor/modifiers/quantization/quantization/base.py
on_start(state, event, **kwargs)
Begin calibrating activations and weights. Calibrate weights only once on start
Source code in src/llmcompressor/modifiers/quantization/quantization/base.py
QuantizedKVParameterCache
Bases: DynamicCache
Quantized KV cache used in the forward call based on HF's dynamic cache. Quantization strategy (tensor, group, channel) set from Quantization arg's strategy Singleton, so that the same cache gets reused in all forward call of self_attn. Each time forward is called, .update() is called, and ._quantize(), ._dequantize() gets called appropriately. The size of tensor is [batch_size, num_heads, seq_len - residual_length, head_dim]
.
Triggered by adding kv_cache_scheme in the recipe.
Example:
```python3 recipe = ''' quant_stage: quant_modifiers: QuantizationModifier: kv_cache_scheme: num_bits: 8 type: float strategy: tensor dynamic: false symmetric: true '''
Source code in src/llmcompressor/modifiers/quantization/cache.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
|
__new__(*args, **kwargs)
get_seq_length(layer_idx=0)
Returns the sequence length of the cached states. A layer index can be optionally passed.
Source code in src/llmcompressor/modifiers/quantization/cache.py
reset()
Reset the instantiation, create new instance on init
reset_states()
reset the kv states (used in calibration)
Source code in src/llmcompressor/modifiers/quantization/cache.py
update(key_states, value_states, layer_idx, cache_kwargs=None)
Get the k_scale and v_scale and output the fakequant-ed key_states and value_states