llmcompressor.transformers.compression.helpers
calculate_offload_device_map(model_stub, reserve_for_hessians=False, num_gpus=1, torch_dtype=torch.float16, model_cls=AutoModelForCausalLM, **model_kwargs)
Calculates the optimal gpu mappings for model_stub stored as torch_dtype. Takes into account extra memory required for quantization and (optionally) GPTQ hessians
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_stub | str | local path or HF stub to calculate mapping for | required |
reserve_for_hessians | whether to reserve memory for GPTQ | False | |
num_gpus | int | number of gpus to utilize | 1 |
model_cls | Type | model class to use when initializing model structure, default is AutoModelForCausalLM | AutoModelForCausalLM |
model_kwargs | keyword arguments to pass to model initializer | {} |
Returns:
Type | Description |
---|---|
Dict[Union[int, str], Union[int, str]] | memory mapping for layers of model_stub to be passed to from_pretrained() |
Source code in src/llmcompressor/transformers/compression/helpers.py
custom_offload_device_map(model_stub, max_memory_per_gpu, num_gpus=1, model_cls=AutoModelForCausalLM, **model_kwargs)
Calculates the optimal gpu mappings for model_stub stored as torch_dtype, where each GPU is restricted to allocating a specific amount of memory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_stub | str | local path or HF stub to calculate mapping for | required |
max_memory_per_gpu | Union[str, int] | Max memory to allocate on each GPU, as either a string such as "10GB" or an integer number of bytes | required |
num_gpus | int | number of gpus to utilize | 1 |
model_cls | Type | model class to use when initializing model structure, default is AutoModelForCausalLM | AutoModelForCausalLM |
model_kwargs | keyword arguments to pass to model initializer | {} |
Returns:
Type | Description |
---|---|
Dict[Union[int, str], Union[int, str]] | memory mapping for layers of model_stub to be passed to from_pretrained() |
Source code in src/llmcompressor/transformers/compression/helpers.py
hessian_memory_requirements(model)
Determines the number of bytes needed to store Hessian data for a single transformer layer in model. This is used for reserving memory for GPTQ quantization
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model | Module | model to calculate requirements for | required |
Returns:
Type | Description |
---|---|
int | number of bytes required to reserve for GPTQ on a single layer |
Source code in src/llmcompressor/transformers/compression/helpers.py
infer_sparse_targets_and_ignores(model, sparsity_structure, sparsity_threshold)
Infers the target and ignore layers in the given model to be used for sparsity compression
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model | Module | model to check | required |
sparsity_structure | str | sparsity structure to check against | required |
sparsity_threshold | float | threshold for sparsity | required |
Returns:
Type | Description |
---|---|
Tuple[List[str], List[str]] | tuple of target and ignore layers |
Source code in src/llmcompressor/transformers/compression/helpers.py
infer_sparsity_structure_from_model(model)
Determines the sparsity structure, if any exists, given the model
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model | Module | model to check for sparsity structure | required |
Returns:
Type | Description |
---|---|
Optional[str] | sparsity structure as a string or None |
Source code in src/llmcompressor/transformers/compression/helpers.py
infer_sparsity_structure_from_stage_modifiers(stage_modifiers)
Determines the sparsity structure, if any exists, given the list of stage modifiers
Parameters:
Name | Type | Description | Default |
---|---|---|---|
stage_modifiers | List[StageModifier] | non-empty list of stage modifiers | required |
Returns:
Type | Description |
---|---|
Optional[str] | sparsity structure as a string or None |
Source code in src/llmcompressor/transformers/compression/helpers.py
is_sparse_compression_target(module, sparsity_threshold, sparsity_structure)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
module | Module | module to check | required |
sparsity_threshold | float | threshold for sparsity | required |
sparsity_structure | str | sparsity structure to check against | required |
Returns:
Type | Description |
---|---|
bool | whether or not the module is a target for sparsity compression, i.e True if it is sparse and follows the sparsity structure, else False |
Source code in src/llmcompressor/transformers/compression/helpers.py
quantization_memory_requirement(model)
Determines the max number of bytes needed to store quantization scale and zp data
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model | Module | model to calculate requirements for | required |
Returns:
Type | Description |
---|---|
int | number of bytes required to reserve for quantization |
Source code in src/llmcompressor/transformers/compression/helpers.py
tensor_follows_mask_structure(tensor, mask='2:4')
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tensor | Tensor | tensor to check | required |
mask | str | mask structure to check for, in the format "n:m", also accepts "unstructured" as a valid mask structure | '2:4' |
Returns:
Type | Description |
---|---|
bool | True if the tensor follows the mask structure, False otherwise. Note, some weights can incidentally be zero, so we check for atleast n zeros in each chunk of size m |