llmcompressor.modifiers.quantization.calibration
calibrate_activations(module, value, base_name)
Calibrate input or output activations by calling the a module's attached observer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
module | Module | torch.nn.Module | required |
base_name | str | substring used to fetch the observer, scales, and zp | required |
value | Tensor | torch.Tensor to be passed to the observer | required |
Source code in src/llmcompressor/modifiers/quantization/calibration.py
calibrate_input_hook(module, args)
Hook to calibrate input activations. Will call the observers to update the scales/zp before applying input QDQ in the module's forward pass.
Source code in src/llmcompressor/modifiers/quantization/calibration.py
calibrate_kv_cache_input_hook(module, args, kwargs)
Hook to update inputs to attention layers when running kv_cache quantization. Will update the passed in kv_cache to singleton QuantizedKVParameterCache.
Source code in src/llmcompressor/modifiers/quantization/calibration.py
calibrate_kv_cache_output_hook(module, _args, _output)
Hook to update k_scale and v_scale parameters when running kv_cache quantization.
Source code in src/llmcompressor/modifiers/quantization/calibration.py
calibrate_output_hook(module, _args, output)
Hook to calibrate output activations. Will call the observers to update the scales/zp before applying output QDQ.
Source code in src/llmcompressor/modifiers/quantization/calibration.py
call_observer(module, base_name, value=None)
Call a module's attached input/weight/output observer using a provided value. Update the module's scale and zp using the observer's return values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
module | Module | torch.nn.Module | required |
base_name | str | substring used to fetch the observer, scales, and zp | required |
value | Optional[Tensor] | torch.Tensor to be passed to the observer for activations. If base_name is "weight", then the module's weight tensor will be used | None |
Source code in src/llmcompressor/modifiers/quantization/calibration.py
freeze_module_quantization(module)
deletes observers when calibration is complete.
apply to full model with model.apply(freeze_module_quantization)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
module | Module | module to freeze quantization for | required |
Source code in src/llmcompressor/modifiers/quantization/calibration.py
initialize_observer(module, base_name)
Initialize observer module and attach as submodule. The name of the observer is fetched from the quantization_args. The name is then used to load the observer from the registry and attached to the module. The name of the observer uses the base_name provided.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
module | Module | torch.nn.Module that the observer is being attached to | required |
base_name | str | str used to name the observer attribute | required |
Source code in src/llmcompressor/modifiers/quantization/calibration.py
initialize_quantized_kv_cache(module)
Initialize a quantized kv_cache on a module (analogous to initializing an observer) When a config specifying kv_cache quantization is applied to a model, the kv_cache args are redefined as the output_activations targeting attention modules.
This function should be called on attention modules with output_activations
Source code in src/llmcompressor/modifiers/quantization/calibration.py
update_weight_zp_scale(module)
marks a layer as ready for calibration which activates observers to update scales and zero points on each forward pass
apply to full model with model.apply(update_weight_zp_scale)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
module | Module | module to set for calibration | required |
quantize_weights_upfront | whether to automatically run weight quantization at the start of calibration | required |