Skip to content

Compression Schemes

PTQ

PTQ is performed to reduce the precision of quantizable weights (e.g., linear layers) to a lower bit-width. Supported formats are:

W4A16

W8A8-INT8

  • Uses channel-wise quantization to compress weights to 8 bits using GPTQ, and uses dynamic per-token quantization to compress activations to 8 bits. Requires calibration dataset for weight quantization. Activation quantization is carried out during inference on vLLM.
  • Useful for speed ups in high QPS regimes or offline serving on vLLM.
  • Recommended for NVIDIA GPUs with compute capability <8.9 (Ampere, Turing, Volta, Pascal, or older).

W8A8-FP8

  • Uses channel-wise quantization to compress weights to 8 bits, and uses dynamic per-token quantization to compress activations to 8 bits. Does not require calibration dataset. Activation quantization is carried out during inference on vLLM.
  • Useful for speed ups in high QPS regimes or offline serving on vLLM.
  • Recommended for NVIDIA GPUs with compute capability >=9.0 (Hopper and Blackwell).

Sparsification

Sparsification reduces model complexity by pruning selected weight values to zero while retaining essential weights in a subset of parameters. Supported formats include:

2:4-Sparsity with FP8 Weight, FP8 Input Activation

  • Uses (1) semi-structured sparsity (SparseGPT), where, for every four contiguous weights in a tensor, two are set to zero. (2) Uses channel-wise quantization to compress weights to 8 bits and dynamic per-token quantization to compress activations to 8 bits.
  • Useful for better inference than W8A8-fp8, with almost no drop in its evaluation score blog. Note: Small models may experience accuracy drops when the remaining non-zero weights are insufficient to recapitulate the original distribution.
  • Recommended for compute capability >=9.0 (Hopper and Blackwell).