Skip to Content

TurboQuant-Prod is a two-stage quantizer optimized for unbiased inner product estimation. Stage 1 applies TurboQuant-MSE with b1b-1 bits. Stage 2 applies the QJL transform (1 bit) to the residual. The result is an unbiased inner product estimator with near-optimal distortion.

Why a Two-Stage Approach?

The MSE-optimal quantizer introduces a multiplicative bias of 2/π0.6372/\pi \approx 0.637 in inner product estimates at b=1b = 1. This bias diminishes with increasing bb but is problematic for attention-based models where inner products must be accurately preserved.

The key insight: use b1b-1 bits for MSE-optimal quantization (reducing the reconstruction error), then use the remaining 1 bit to apply QJL to the residual. QJL is unbiased by construction, so the combined estimator is also unbiased.

MSE Bias vs Prod Unbiasedness

This visualization makes the bias tangible. For fixed vectors xx and yy, we repeatedly quantize xx and compute y,x~\langle y, \tilde{x} \rangle. The MSE histogram (purple) is centered away from the true value — that’s the bias. The Prod histogram (green) is centered on the true value.

Loading visualization...

Algorithm 2: TurboQuant-Prod

Setup (one-time, shared)

Input: dimension dd, bit-width bb

  1. Instantiate a TurboQuant-MSE instance with bit-width b1b - 1
  2. Generate a random projection matrix SRd×d\boldsymbol{S} \in \mathbb{R}^{d \times d} with i.i.d. entries Si,jN(0,1)\boldsymbol{S}_{i,j} \sim \mathcal{N}(0, 1)

Quantization: QUANT(x\boldsymbol{x})

Input: xSd1\boldsymbol{x} \in \mathbb{S}^{d-1}

  1. Apply MSE quantization with b1b-1 bits: idxQUANTmse(x)\text{idx} \leftarrow \text{QUANT}_{\text{mse}}(\boldsymbol{x})

  2. Compute the residual vector: rxDEQUANTmse(idx)\boldsymbol{r} \leftarrow \boldsymbol{x} - \text{DEQUANT}_{\text{mse}}(\text{idx})

  3. Apply QJL to the residual: qjlsign(Sr)\text{qjl} \leftarrow \text{sign}(\boldsymbol{S} \cdot \boldsymbol{r})

  4. Output: (idx, qjl, r2)(\text{idx},\ \text{qjl},\ \|\boldsymbol{r}\|_2)

Storage: (b1)d(b-1) \cdot d bits for idx + dd bits for qjl + one scalar for r2\|\boldsymbol{r}\|_2 = bdb \cdot d bits + O(1)O(1).

Dequantization: DEQUANT(idx,qjl,γ\text{idx}, \text{qjl}, \gamma)

  1. Reconstruct MSE estimate: x~mseDEQUANTmse(idx)\tilde{\boldsymbol{x}}_{\text{mse}} \leftarrow \text{DEQUANT}_{\text{mse}}(\text{idx})

  2. Reconstruct QJL estimate of residual: x~qjlπ/2dγSqjl\tilde{\boldsymbol{x}}_{\text{qjl}} \leftarrow \frac{\sqrt{\pi/2}}{d} \cdot \gamma \cdot \boldsymbol{S}^\top \cdot \text{qjl}

  3. Output: x~=x~mse+x~qjl\tilde{\boldsymbol{x}} = \tilde{\boldsymbol{x}}_{\text{mse}} + \tilde{\boldsymbol{x}}_{\text{qjl}}

Two-Stage Pipeline

Step through the full Prod pipeline: MSE quantize with b1b-1 bits, compute the residual, apply QJL to the residual, then combine. Compare inner product estimates at each stage.

Loading visualization...

Quantized Johnson-Lindenstrauss (QJL)

QJL is a 1-bit quantization scheme based on random projections. For xRd\boldsymbol{x} \in \mathbb{R}^d:

Qqjl(x):=sign(Sx){1,+1}dQ_{\text{qjl}}(\boldsymbol{x}) := \text{sign}(\boldsymbol{S} \cdot \boldsymbol{x}) \in \{-1, +1\}^d

The dequantization map:

Qqjl1(z):=π/2dSzQ_{\text{qjl}}^{-1}(\boldsymbol{z}) := \frac{\sqrt{\pi/2}}{d} \cdot \boldsymbol{S}^\top \cdot \boldsymbol{z}

The scaling factor π/2/d\sqrt{\pi/2}/d comes from E[Z]=2/π\mathbb{E}[|Z|] = \sqrt{2/\pi} for ZN(0,1)Z \sim \mathcal{N}(0,1).

Key property: for any xSd1\boldsymbol{x} \in \mathbb{S}^{d-1} and any yRd\boldsymbol{y} \in \mathbb{R}^d:

E[y,Qqjl1(Qqjl(x))]=y,x\mathbb{E}\left[\left\langle \boldsymbol{y}, Q_{\text{qjl}}^{-1}(Q_{\text{qjl}}(\boldsymbol{x})) \right\rangle\right] = \langle \boldsymbol{y}, \boldsymbol{x} \rangle

with variance bounded by π2dy22\frac{\pi}{2d} \cdot \|\boldsymbol{y}\|_2^2.

QJL Unbiasedness Proof

Let s1,,sd\boldsymbol{s}_1, \ldots, \boldsymbol{s}_d be the rows of S\boldsymbol{S}. The inner product estimate is:

y,Qqjl1(Qqjl(x))=1di[d]π/2siysign(six)\left\langle \boldsymbol{y}, Q_{\text{qjl}}^{-1}(Q_{\text{qjl}}(\boldsymbol{x})) \right\rangle = \frac{1}{d} \sum_{i \in [d]} \sqrt{\pi/2} \cdot \boldsymbol{s}_i^\top \boldsymbol{y} \cdot \text{sign}(\boldsymbol{s}_i^\top \boldsymbol{x})

Since si\boldsymbol{s}_i has i.i.d. N(0,1)\mathcal{N}(0,1) entries, the pair (siy,six)(\boldsymbol{s}_i^\top \boldsymbol{y}, \boldsymbol{s}_i^\top \boldsymbol{x}) is jointly Gaussian. By a standard Gaussian identity, for jointly Gaussian (U,V)(U, V) with Var(V)=1\text{Var}(V) = 1:

E[Usign(V)]=2/πCov(U,V)\mathbb{E}[U \cdot \text{sign}(V)] = \sqrt{2/\pi} \cdot \text{Cov}(U, V)

Why? Write U=Cov(U,V)Var(V)V+WU = \frac{\text{Cov}(U,V)}{\text{Var}(V)} V + W where WVW \perp V. Then E[Usign(V)]=Cov(U,V)E[V]=Cov(U,V)2/π\mathbb{E}[U \cdot \text{sign}(V)] = \text{Cov}(U,V) \cdot \mathbb{E}[|V|] = \text{Cov}(U,V) \cdot \sqrt{2/\pi}.

Therefore:

E[siysign(six)]=2/πy,x\mathbb{E}\left[\boldsymbol{s}_i^\top \boldsymbol{y} \cdot \text{sign}(\boldsymbol{s}_i^\top \boldsymbol{x})\right] = \sqrt{2/\pi} \cdot \langle \boldsymbol{y}, \boldsymbol{x} \rangle

Plugging back with the π/2\sqrt{\pi/2} scaling:

E[y,Qqjl1(Qqjl(x))]=1di=1dπ/22/πy,x=y,x\mathbb{E}\left[\left\langle \boldsymbol{y}, Q_{\text{qjl}}^{-1}(Q_{\text{qjl}}(\boldsymbol{x})) \right\rangle\right] = \frac{1}{d} \sum_{i=1}^d \sqrt{\pi/2} \cdot \sqrt{2/\pi} \cdot \langle \boldsymbol{y}, \boldsymbol{x} \rangle = \langle \boldsymbol{y}, \boldsymbol{x} \rangle \qquad \blacksquare

Theorem 2 (Performance Guarantee)

For any bit-width b1b \geq 1 and any xSd1\boldsymbol{x} \in \mathbb{S}^{d-1}, yRd\boldsymbol{y} \in \mathbb{R}^d:

Unbiasedness

Ex~[y,x~]=y,x\mathbb{E}_{\tilde{\boldsymbol{x}}}\left[\langle \boldsymbol{y}, \tilde{\boldsymbol{x}} \rangle\right] = \langle \boldsymbol{y}, \boldsymbol{x} \rangle

Inner-Product Distortion Bound

Dprod:=Ex~[y,xy,x~2]3π2y22d14bD_{\text{prod}} := \mathbb{E}_{\tilde{\boldsymbol{x}}}\left[\left|\langle \boldsymbol{y}, \boldsymbol{x} \rangle - \langle \boldsymbol{y}, \tilde{\boldsymbol{x}} \rangle\right|^2\right] \leq \frac{\sqrt{3}\pi^2 \cdot \|\boldsymbol{y}\|_2^2}{d} \cdot \frac{1}{4^b}

For small bit-widths:

bbDprodD_{\text{prod}} (times y22/d\|\boldsymbol{y}\|_2^2 / d)
11.571.57
20.560.56
30.180.18
40.0470.047

Full Proof of Theorem 2

Part A: Unbiasedness

Step 1: Since x~=x~mse+x~qjl\tilde{\boldsymbol{x}} = \tilde{\boldsymbol{x}}_{\text{mse}} + \tilde{\boldsymbol{x}}_{\text{qjl}}, condition on x~mse\tilde{\boldsymbol{x}}_{\text{mse}}:

E[y,x~x~mse]=y,x~mse+E[y,x~qjlx~mse]\mathbb{E}\left[\langle \boldsymbol{y}, \tilde{\boldsymbol{x}} \rangle \mid \tilde{\boldsymbol{x}}_{\text{mse}}\right] = \langle \boldsymbol{y}, \tilde{\boldsymbol{x}}_{\text{mse}} \rangle + \mathbb{E}\left[\langle \boldsymbol{y}, \tilde{\boldsymbol{x}}_{\text{qjl}} \rangle \mid \tilde{\boldsymbol{x}}_{\text{mse}}\right]

Step 2: QJL is unbiased, so conditioned on x~mse\tilde{\boldsymbol{x}}_{\text{mse}} (which fixes the residual r\boldsymbol{r}):

E[y,x~qjlx~mse]=y,r\mathbb{E}\left[\langle \boldsymbol{y}, \tilde{\boldsymbol{x}}_{\text{qjl}} \rangle \mid \tilde{\boldsymbol{x}}_{\text{mse}}\right] = \langle \boldsymbol{y}, \boldsymbol{r} \rangle

Step 3: Since r=xx~mse\boldsymbol{r} = \boldsymbol{x} - \tilde{\boldsymbol{x}}_{\text{mse}}:

E[y,x~x~mse]=y,x~mse+y,xx~mse=y,x\mathbb{E}\left[\langle \boldsymbol{y}, \tilde{\boldsymbol{x}} \rangle \mid \tilde{\boldsymbol{x}}_{\text{mse}}\right] = \langle \boldsymbol{y}, \tilde{\boldsymbol{x}}_{\text{mse}} \rangle + \langle \boldsymbol{y}, \boldsymbol{x} - \tilde{\boldsymbol{x}}_{\text{mse}} \rangle = \langle \boldsymbol{y}, \boldsymbol{x} \rangle

By the law of total expectation: E[y,x~]=y,x\mathbb{E}[\langle \boldsymbol{y}, \tilde{\boldsymbol{x}} \rangle] = \langle \boldsymbol{y}, \boldsymbol{x} \rangle \blacksquare

Part B: Distortion Bound

Step 1: The conditional distortion equals the QJL variance:

E[y,ry,x~qjl2x~mse]=Var(y,x~qjlx~mse)\mathbb{E}\left[\left|\langle \boldsymbol{y}, \boldsymbol{r} \rangle - \langle \boldsymbol{y}, \tilde{\boldsymbol{x}}_{\text{qjl}} \rangle\right|^2 \mid \tilde{\boldsymbol{x}}_{\text{mse}}\right] = \text{Var}\left(\langle \boldsymbol{y}, \tilde{\boldsymbol{x}}_{\text{qjl}} \rangle \mid \tilde{\boldsymbol{x}}_{\text{mse}}\right)

Step 2: QJL variance bound with rescaling (r\|\boldsymbol{r}\| is fixed given x~mse\tilde{\boldsymbol{x}}_{\text{mse}}):

π2dr22y22\leq \frac{\pi}{2d} \cdot \|\boldsymbol{r}\|_2^2 \cdot \|\boldsymbol{y}\|_2^2

Step 3: Take expectation over x~mse\tilde{\boldsymbol{x}}_{\text{mse}}. Note E[r22]=Dmse(b1)\mathbb{E}[\|\boldsymbol{r}\|_2^2] = D_{\text{mse}}(b-1):

Dprodπ2dy22Dmse(b1)D_{\text{prod}} \leq \frac{\pi}{2d} \cdot \|\boldsymbol{y}\|_2^2 \cdot D_{\text{mse}}(b-1)

Step 4: By Theorem 1, Dmse(b1)3π244bD_{\text{mse}}(b-1) \leq \frac{\sqrt{3}\pi}{2} \cdot \frac{4}{4^b}:

Dprodπ2dy2223π4b=3π2y22d14bD_{\text{prod}} \leq \frac{\pi}{2d} \cdot \|\boldsymbol{y}\|_2^2 \cdot \frac{2\sqrt{3}\pi}{4^b} = \frac{\sqrt{3}\pi^2 \cdot \|\boldsymbol{y}\|_2^2}{d} \cdot \frac{1}{4^b} \qquad \blacksquare

Comparing with the Lower Bound

The information-theoretic lower bound on inner-product distortion is:

Dprody22d14bD_{\text{prod}} \geq \frac{\|\boldsymbol{y}\|_2^2}{d} \cdot \frac{1}{4^b}

TurboQuant-Prod is within a factor of 3π217.1\sqrt{3}\pi^2 \approx 17.1 of this bound. The gap is larger than for MSE because the bound chains two levels of approximation (MSE bound + QJL variance bound).

Last updated on