TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
Paper

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Majid Hadian
2026.03.27
·Arxiv·by 이호민
#AI#Algorithm#Data Compression#Machine Learning#Vector Quantization

Key Points

  • 1TurboQuant is an online vector quantization method designed to achieve near-optimal distortion rates for both mean-squared error (MSE) and inner product estimation, overcoming limitations of existing techniques.
  • 2It achieves this by randomly rotating input vectors to induce a Beta distribution, enabling optimal scalar quantization per coordinate for MSE, and employs a two-stage approach with 1-bit Quantized JL (QJL) on residuals for unbiased inner product estimation.
  • 3The method provides provably near-optimal distortion bounds (within a factor of ~2.7 of theoretical limits) and demonstrates strong empirical performance in KV cache quantization and nearest neighbor search tasks, outperforming product quantization while reducing indexing time.

The paper introduces TurboQuant, a novel vector quantization (VQ) framework designed for high-dimensional Euclidean vectors, aiming to minimize mean-squared error (MSE) and inner product distortion. It addresses limitations of existing methods, which often lack optimal distortion rates, computational efficiency, or accelerator compatibility, making them unsuitable for online, real-time AI applications like KV cache quantization. TurboQuant is data-oblivious, suitable for online applications, and highly accelerator-friendly, achieving near-optimal distortion rates across various bit-widths and dimensions.

The core methodology of TurboQuant is a two-stage process. First, it proposes an MSE-optimal vector quantizer, TurboQuantmse\text{TurboQuant}_{\text{mse}}, which minimizes the β„“2\ell_2 norm of the residual. Second, recognizing that MSE-optimal quantizers introduce bias in inner product estimation, it applies a 1-bit Quantized Johnson-Lindenstrauss (QJL) transform on the residual of the TurboQuantmse\text{TurboQuant}_{\text{mse}} output to achieve an unbiased and low-distortion inner product quantizer, TurboQuantprod\text{TurboQuant}_{\text{prod}}.

1. MSE-Optimal Quantizer (TurboQuantmse\text{TurboQuant}_{\text{mse}}):
For an input vector x∈Rd\mathbf{x} \in \mathbb{R}^d with βˆ₯xβˆ₯2=1\|\mathbf{x}\|_2 = 1, the TurboQuantmse\text{TurboQuant}_{\text{mse}} method operates as follows:

  • Random Rotation: The input vector x\mathbf{x} is first transformed by a random rotation matrix Π∈RdΓ—d\mathbf{\Pi} \in \mathbb{R}^{d \times d}, yielding y=Ξ x\mathbf{y} = \mathbf{\Pi} \mathbf{x}. This random rotation ensures that the distribution of each coordinate yjy_j of the rotated vector y\mathbf{y} follows a (scaled/shifted) Beta distribution fX(x)=Ξ“(d/2)πΓ((dβˆ’1)/2)(1βˆ’x2)(dβˆ’3)/2f_X(x) = \frac{\Gamma(d/2)}{\sqrt{\pi}\Gamma((d-1)/2)} (1-x^2)^{(d-3)/2} for x∈[βˆ’1,1]x \in [-1, 1] (Lemma 1).
  • Near-Independence in High Dimensions: In high dimensions dd, this Beta distribution converges to a Gaussian distribution N(0,1/d)\mathcal{N}(0, 1/d), and crucially, distinct coordinates of y\mathbf{y} become nearly independent. This property simplifies the quantization problem significantly, allowing for independent scalar quantization of each coordinate.
  • Optimal Scalar Quantization: For each coordinate yjy_j, an optimal scalar quantizer is applied. The optimal centroids for scalar quantization are determined by solving a continuous 1-dimensional k-means problem (Lloyd-Max algorithm) for the Beta distribution fX(x)f_X(x). This involves minimizing the quantization cost C(fX,b)=minβ‘βˆ’1≀c1≀⋯≀c2b≀1βˆ‘i=12b∫ciβˆ’1+ci2ci+ci+12∣xβˆ’ci∣2fX(x)dxC(f_X, b) = \min_{-1 \le c_1 \le \dots \le c_{2^b} \le 1} \sum_{i=1}^{2^b} \int_{\frac{c_{i-1}+c_i}{2}}^{\frac{c_i+c_{i+1}}{2}} |x-c_i|^2 f_X(x) dx. These centroids are precomputed and stored for efficiency.
  • Quantization and Dequantization (Algorithm 1):
    • Quantization: For each coordinate yjy_j of the rotated vector y\mathbf{y}, the closest centroid ckc_k is found, and its index kk (encoded as a bb-bit integer) is stored. This results in an index vector idx∈{0,1}bβ‹…d\text{idx} \in \{0,1\}^{b \cdot d}.
    • Dequantization: To reconstruct, the stored indices are used to retrieve the corresponding centroids, forming a reconstructed vector y~\tilde{\mathbf{y}}. This vector is then rotated back by Ξ T\mathbf{\Pi}^T to obtain the final dequantized vector x~=Ξ Ty~\tilde{\mathbf{x}} = \mathbf{\Pi}^T \tilde{\mathbf{y}}.
  • Performance Guarantees (Theorem 1): For x∈Sdβˆ’1\mathbf{x} \in \mathbb{S}^{d-1}, the MSE distortion Dmse=E[βˆ₯xβˆ’x~βˆ₯22]D_{\text{mse}} = \mathbb{E}[\|\mathbf{x} - \tilde{\mathbf{x}}\|_2^2] is bounded by Dmse≀3Ο€2β‹…14bD_{\text{mse}} \le \sqrt{\frac{3\pi}{2}} \cdot \frac{1}{4^b} for any bβ‰₯0b \ge 0. For small bit-widths, the bounds are tighter (e.g., Dmseβ‰ˆ0.36D_{\text{mse}} \approx 0.36 for b=1b=1).

2. Inner Product Optimal Quantizer (TurboQuantprod\text{TurboQuant}_{\text{prod}}):
TurboQuantmse\text{TurboQuant}_{\text{mse}} minimizes β„“2\ell_2 distance but introduces bias in inner product estimation. To provide an unbiased inner product estimate with low distortion, TurboQuantprod\text{TurboQuant}_{\text{prod}} employs a two-stage approach:

  • Stage 1: MSE Quantization for Residual Reduction: The input vector x\mathbf{x} is first quantized using TurboQuantmse\text{TurboQuant}_{\text{mse}} with bβˆ’1b-1 bits per coordinate, yielding a dequantized vector x~mse\tilde{\mathbf{x}}_{\text{mse}}. The residual vector is then computed as r=xβˆ’x~mse\mathbf{r} = \mathbf{x} - \tilde{\mathbf{x}}_{\text{mse}}. This step minimizes the β„“2\ell_2 norm of the residual.
  • Stage 2: 1-bit QJL on Residual: The residual vector r\mathbf{r} is then quantized using the 1-bit Quantized Johnson-Lindenstrauss (QJL) transform.
    • QJL Definition (Definition 1): Qqjl(v)=sign(Sv)Q_{\text{qjl}}(\mathbf{v}) = \text{sign}(\mathbf{S}\mathbf{v}) for v∈Rd\mathbf{v} \in \mathbb{R}^d, where S∈RdΓ—d\mathbf{S} \in \mathbb{R}^{d \times d} is a random matrix with i.i.d. N(0,1)\mathcal{N}(0,1) entries. The dequantization map is Qqjlβˆ’1(z)=Ο€/2dSTzQ^{-1}_{\text{qjl}}(\mathbf{z}) = \frac{\sqrt{\pi/2}}{d} \mathbf{S}^T \mathbf{z} for z∈{βˆ’1,+1}d\mathbf{z} \in \{-1, +1\}^d.
    • QJL Performance (Lemma 4): QJL provides an unbiased estimate for inner products: E[⟨y,Qqjlβˆ’1(Qqjl(x))⟩]=⟨y,x⟩\mathbb{E}[\langle \mathbf{y}, Q^{-1}_{\text{qjl}}(Q_{\text{qjl}}(\mathbf{x})) \rangle] = \langle \mathbf{y}, \mathbf{x} \rangle. Its variance is bounded by Var(⟨y,Qqjlβˆ’1(Qqjl(x))⟩)≀π2dβˆ₯yβˆ₯22\text{Var}(\langle \mathbf{y}, Q^{-1}_{\text{qjl}}(Q_{\text{qjl}}(\mathbf{x})) \rangle) \le \frac{\pi}{2d} \|\mathbf{y}\|_2^2.
  • Performance Guarantees (Theorem 2): For x,y∈Sdβˆ’1\mathbf{x}, \mathbf{y} \in \mathbb{S}^{d-1}, TurboQuantprod\text{TurboQuant}_{\text{prod}} is unbiased for inner product estimation: E[⟨y,Qprodβˆ’1(Qprod(x))⟩]=⟨y,x⟩\mathbb{E}[\langle \mathbf{y}, Q^{-1}_{\text{prod}}(Q_{\text{prod}}(\mathbf{x})) \rangle] = \langle \mathbf{y}, \mathbf{x} \rangle. The inner product distortion Dprod=E[(⟨y,xβŸ©βˆ’βŸ¨y,Qprodβˆ’1(Qprod(x))⟩)2]D_{\text{prod}} = \mathbb{E}[(\langle \mathbf{y}, \mathbf{x} \rangle - \langle \mathbf{y}, Q^{-1}_{\text{prod}}(Q_{\text{prod}}(\mathbf{x})) \rangle)^2] is bounded by Dprod≀3Ο€22dβ‹…14bD_{\text{prod}} \le \frac{\sqrt{3\pi^2}}{2d} \cdot \frac{1}{4^b} for any bβ‰₯0b \ge 0.

3. Information-Theoretic Lower Bounds (Theorem 3):
The paper provides information-theoretic lower bounds on the best achievable distortion rates for any randomized vector quantizer, using Shannon's lower bound (SLB) and Yao's minimax principle. For an input vector x∈Sdβˆ’1\mathbf{x} \in \mathbb{S}^{d-1} and total bit complexity B=bβ‹…dB = b \cdot d:

  • MSE Lower Bound: Dmseβ‰₯14bD_{\text{mse}} \ge \frac{1}{4^b}.
  • Inner Product Lower Bound: Dprodβ‰₯βˆ₯yβˆ₯222d14bD_{\text{prod}} \ge \frac{\|\mathbf{y}\|_2^2}{2d} \frac{1}{4^b}.
Comparing TurboQuant's upper bounds to these lower bounds reveals near-optimality. For MSE, TurboQuantmse\text{TurboQuant}_{\text{mse}}'s distortion is within a factor of at most 3Ο€/2β‰ˆ2.7\sqrt{3\pi/2} \approx 2.7 of the information-theoretic lower bound, with this factor decreasing for smaller bit-widths.

Experimental Validation:
Experimental results confirm the theoretical findings:

  • Observed distortions closely align with predictions across various real-world datasets, approaching the established lower bounds.
  • For KV cache quantization, TurboQuant\text{TurboQuant} achieves absolute quality neutrality (perfect long-context retrieval) with 3.5 bits per channel and marginal quality degradation with 2.5 bits per channel, compressing the KV cache by more than 5Γ—5\times.
  • In nearest neighbor search tasks, TurboQuant\text{TurboQuant} consistently outperforms existing data-dependent product quantization (PQ) techniques in recall while reducing indexing time to virtually zero due to its data-oblivious nature.