
Quantization concepts
Key Points
- 1Quantization reduces the memory footprint and computational cost of large models by representing weights and activations with lower-precision data types like int8, leading to smaller model sizes and faster inference.
- 2This process typically involves affine quantization, which maps float32 values to integer ranges using scale (S) and zero-point (Z) parameters, with options for symmetric or asymmetric mapping and per-tensor or per-channel granularity.
- 3Quantization can be applied post-training (PTQ) or during training (QAT) to manage the efficiency-accuracy trade-off, with advanced formats like int4 and FP8 further reducing precision, all integrated into the Transformers library.
Quantization is a technique designed to reduce the memory footprint and computational cost of large machine learning models by representing weights and/or activations with lower-precision data types, such as 8-bit integers (int8), instead of 32-bit floating-point (float32). This leads to smaller model sizes (e.g., int8 models being approximately 4 times smaller than float32), faster inference due to specialized hardware instructions for lower-precision operations, and reduced energy consumption. The primary trade-off is between efficiency and accuracy, as reducing precision introduces quantization noise, which must be minimized to preserve model performance.
The core of quantization involves mapping a range of float32 values to a smaller range represented by a lower-precision integer type, typically for int8. This mapping is primarily performed through affine quantization.
Affine Quantization Schemes:
This method identifies the minimum () and maximum () values in a float32 tensor and maps this range to the target integer range (). There are two main approaches:
- Symmetric Quantization: Assumes the float32 range is symmetric around zero (e.g., ), mapping it symmetrically to the integer range (e.g., ). In this scheme, maps directly to the int8 value , and it requires only a single parameter: the scale ().
- Asymmetric (Affine) Quantization: Does not assume symmetry and maps the exact float32 range to the full int8 range like . This method requires two parameters: a scale () and a zero-point ().
The formulas for these parameters are:
- Scale (): A positive float32 number representing the ratio between the float32 and int8 ranges.
- Zero-point (): An int8 value that corresponds to the float32 value . In symmetric quantization, is typically fixed at .
Given these parameters, a float32 value can be quantized to an int8 value using:
Conversely, an int8 value can be dequantized back to an approximate float32 value :
During inference, computations like matrix multiplication are performed using the quantized int8 values, with results often accumulated in a higher-precision type (e.g., int32) before dequantization for subsequent layers.
Quantization Data Types Beyond int8:
- int4: Further reduces model size and memory usage by half compared to int8. It applies the same affine or symmetric principles, mapping float32 to 16 possible values (e.g., for signed int4). Due to hardware limitations, int4 values are typically combined through weight packing, where two int4 values are stored in a single int8 byte.
- 8-bit Floating-Point (FP8): Offers an alternative to integer quantization by retaining the floating-point structure (sign, exponent, mantissa) with fewer bits. Two common variants exist:
- E4M3: 1 sign bit, 4 exponent bits, 3 mantissa bits; offers higher precision but a smaller dynamic range.
- E5M2: 1 sign bit, 5 exponent bits, 2 mantissa bits; offers a wider dynamic range but lower precision.
Granularity:
Quantization parameters ( and ) can be calculated with different granularities:
- Per-Tensor: A single set of and is computed for the entire tensor. Simpler but less accurate if value distributions within the tensor are broad.
- Per-Channel (or Per-Group/Block): Separate and values are computed for each channel, group, or block within a tensor. This approach is more accurate and generally leads to better performance at the cost of slightly increased complexity and metadata memory.
Quantization Techniques:
- Post-Training Quantization (PTQ): Quantization is applied to a model *after* it has been fully trained in full precision. It's simpler to implement but may result in a larger accuracy degradation.
- Quantization-Aware Training (QAT): Quantization effects are simulated *during* the training process by inserting "fake quantization" operations. These operations simulate the rounding errors of quantization, allowing the model to adapt to and mitigate these errors. QAT typically yields better accuracy, especially at lower bit-widths, as the model "learns" to be robust to quantization.
In the Transformers library, quantization is integrated through a unified HfQuantizer API and QuantizationConfig classes, supporting various backends (e.g., bitsandbytes, torchao, compressed-tensors). The general workflow involves choosing a suitable quantization method and either loading a pre-quantized model or applying a specific quantization method to a float32/float16/bfloat16 model using QuantizationConfig.