Compression Schemes

PTQ

PTQ is performed to reduce the precision of quantizable weights (e.g., linear layers) to a lower bit-width. Supported formats are:

W4A16

Uses GPTQ to compress weights to 4 bits. Requires calibration dataset.
Optionally, AWQ can also be leveraged for W4A16 quantization
Useful speed ups in low QPS regimes with more weight compression.
Recommended for any GPUs types.

W8A8-INT8

Uses channel-wise quantization to compress weights to 8 bits using GPTQ, and uses dynamic per-token quantization to compress activations to 8 bits. Requires calibration dataset for weight quantization. Activation quantization is carried out during inference on vLLM.
Useful for speed ups in high QPS regimes or offline serving on vLLM.
Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older).

W8A8-FP8

Uses channel-wise quantization to compress weights to 8 bits, and uses dynamic per-token quantization to compress activations to 8 bits. Does not require calibration dataset. Activation quantization is carried out during inference on vLLM.
Useful for speed ups in high QPS regimes or offline serving on vLLM.
Recommended for NVIDIA GPUs with compute capability >=9.0 (Hopper and Blackwell).

Sparsification

Sparsification reduces model complexity by pruning selected weight values to zero while retaining essential weights in a subset of parameters. Supported formats include:

2:4-Sparsity with FP8 Weight, FP8 Input Activation

Uses (1) semi-structured sparsity (SparseGPT), where, for every four contiguous weights in a tensor, two are set to zero. (2) Uses channel-wise quantization to compress weights to 8 bits and dynamic per-token quantization to compress activations to 8 bits.
Useful for better inference than W8A8-fp8, with almost no drop in its evaluation score blog. Note: Small models may experience accuracy drops when the remaining non-zero weights are insufficient to recapitulate the original distribution.
Recommended for compute capability >=9.0 (Hopper and Blackwell).