
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
Key Points
- 1TurboQuant is an online vector quantization method designed to achieve near-optimal distortion rates for both mean-squared error (MSE) and inner product estimation, overcoming limitations of existing techniques.
- 2It achieves this by randomly rotating input vectors to induce a Beta distribution, enabling optimal scalar quantization per coordinate for MSE, and employs a two-stage approach with 1-bit Quantized JL (QJL) on residuals for unbiased inner product estimation.
- 3The method provides provably near-optimal distortion bounds (within a factor of ~2.7 of theoretical limits) and demonstrates strong empirical performance in KV cache quantization and nearest neighbor search tasks, outperforming product quantization while reducing indexing time.
The paper introduces TurboQuant, a novel vector quantization (VQ) framework designed for high-dimensional Euclidean vectors, aiming to minimize mean-squared error (MSE) and inner product distortion. It addresses limitations of existing methods, which often lack optimal distortion rates, computational efficiency, or accelerator compatibility, making them unsuitable for online, real-time AI applications like KV cache quantization. TurboQuant is data-oblivious, suitable for online applications, and highly accelerator-friendly, achieving near-optimal distortion rates across various bit-widths and dimensions.
The core methodology of TurboQuant is a two-stage process. First, it proposes an MSE-optimal vector quantizer, , which minimizes the norm of the residual. Second, recognizing that MSE-optimal quantizers introduce bias in inner product estimation, it applies a 1-bit Quantized Johnson-Lindenstrauss (QJL) transform on the residual of the output to achieve an unbiased and low-distortion inner product quantizer, .
1. MSE-Optimal Quantizer ():
For an input vector with , the method operates as follows:
- Random Rotation: The input vector is first transformed by a random rotation matrix , yielding . This random rotation ensures that the distribution of each coordinate of the rotated vector follows a (scaled/shifted) Beta distribution for (Lemma 1).
- Near-Independence in High Dimensions: In high dimensions , this Beta distribution converges to a Gaussian distribution , and crucially, distinct coordinates of become nearly independent. This property simplifies the quantization problem significantly, allowing for independent scalar quantization of each coordinate.
- Optimal Scalar Quantization: For each coordinate , an optimal scalar quantizer is applied. The optimal centroids for scalar quantization are determined by solving a continuous 1-dimensional k-means problem (Lloyd-Max algorithm) for the Beta distribution . This involves minimizing the quantization cost . These centroids are precomputed and stored for efficiency.
- Quantization and Dequantization (Algorithm 1):
- Quantization: For each coordinate of the rotated vector , the closest centroid is found, and its index (encoded as a -bit integer) is stored. This results in an index vector .
- Dequantization: To reconstruct, the stored indices are used to retrieve the corresponding centroids, forming a reconstructed vector . This vector is then rotated back by to obtain the final dequantized vector .
- Performance Guarantees (Theorem 1): For , the MSE distortion is bounded by for any . For small bit-widths, the bounds are tighter (e.g., for ).
2. Inner Product Optimal Quantizer ():
minimizes distance but introduces bias in inner product estimation. To provide an unbiased inner product estimate with low distortion, employs a two-stage approach:
- Stage 1: MSE Quantization for Residual Reduction: The input vector is first quantized using with bits per coordinate, yielding a dequantized vector . The residual vector is then computed as . This step minimizes the norm of the residual.
- Stage 2: 1-bit QJL on Residual: The residual vector is then quantized using the 1-bit Quantized Johnson-Lindenstrauss (QJL) transform.
- QJL Definition (Definition 1): for , where is a random matrix with i.i.d. entries. The dequantization map is for .
- QJL Performance (Lemma 4): QJL provides an unbiased estimate for inner products: . Its variance is bounded by .
- Performance Guarantees (Theorem 2): For , is unbiased for inner product estimation: . The inner product distortion is bounded by for any .
3. Information-Theoretic Lower Bounds (Theorem 3):
The paper provides information-theoretic lower bounds on the best achievable distortion rates for any randomized vector quantizer, using Shannon's lower bound (SLB) and Yao's minimax principle. For an input vector and total bit complexity :
- MSE Lower Bound: .
- Inner Product Lower Bound: .
Experimental Validation:
Experimental results confirm the theoretical findings:
- Observed distortions closely align with predictions across various real-world datasets, approaching the established lower bounds.
- For KV cache quantization, achieves absolute quality neutrality (perfect long-context retrieval) with 3.5 bits per channel and marginal quality degradation with 2.5 bits per channel, compressing the KV cache by more than .
- In nearest neighbor search tasks, consistently outperforms existing data-dependent product quantization (PQ) techniques in recall while reducing indexing time to virtually zero due to its data-oblivious nature.