llmcompressor.datasets.utils
format_calibration_data(tokenized_dataset, num_calibration_samples=None, do_shuffle=True, collate_fn=default_data_collator)
Creates a dataloader out of the calibration dataset split, trimming it to the desired number of calibration samples
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenized_dataset | Dataset | dataset to convert to dataloader | required |
num_calibration_samples | Optional[int] | number of data samples to convert | None |
do_shuffle | bool | whether to shuffle the dataset before selecting calibration samples, true by default | True |
collate_fn | Callable | optional custom collate function, or use default | default_data_collator |
Returns:
Type | Description |
---|---|
List[Tensor] | list of trimmed calibration data tensors |
Source code in src/llmcompressor/datasets/utils.py
get_calibration_dataloader(dataset_args, processor)
Get the dataloader used for oneshot calibration.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_args | DatasetArguments | DatasetArguments that contains the dataset parameters. | required |
processor | Processor | Processor or the tokenizer of the model. | required |
Returns:
Type | Description |
---|---|
DataLoader | PyTorch dataloader object that contains the calibration dataset. |
Source code in src/llmcompressor/datasets/utils.py
get_processed_dataset(dataset_args, processor=None, do_oneshot=False, do_train=True)
Loads datasets for each flow based on dataset_args, stores a Dataset for each enabled flow in datasets
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_args | DatasetArguments | DatasetArguments that contain dataset loading and processing params | required |
processor | Optional[Processor] | processor or tokenizer to use for dataset tokenization | None |
do_oneshot | bool | True for oneshot pathway | False |
do_train | bool | True for train pathway | True |
Returns:
Type | Description |
---|---|
Optional[Dict[str, Dataset]] | A dataset corresponding to either train or calibration (oneshot) |
Source code in src/llmcompressor/datasets/utils.py
make_dataset_splits(tokenized_datasets, do_oneshot=True, do_train=False)
Restructures the datasets dictionary based on what tasks will be run train
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tokenized_datasets | Dict[str, Any] | dictionary of processed datasets | required |
do_oneshot | bool | Whether to store the calibration dataset | True |
Returns:
Type | Description |
---|---|
Dict[str, Dataset] | A dataset corresponding to either train or calibration (oneshot) |