TurboQuant Prod (Algorithm 2)

TurboQuant-Prod is a two-stage quantizer optimized for unbiased inner product estimation. Stage 1 applies TurboQuant-MSE with $b-1$ bits. Stage 2 applies the QJL transform (1 bit) to the residual. The result is an unbiased inner product estimator with near-optimal distortion.

Why a Two-Stage Approach?

The MSE-optimal quantizer introduces a multiplicative bias of $2/\pi \approx 0.637$ in inner product estimates at $b = 1$ . This bias diminishes with increasing $b$ but is problematic for attention-based models where inner products must be accurately preserved.

The key insight: use $b-1$ bits for MSE-optimal quantization (reducing the reconstruction error), then use the remaining 1 bit to apply QJL to the residual. QJL is unbiased by construction, so the combined estimator is also unbiased.

MSE Bias vs Prod Unbiasedness

This visualization makes the bias tangible. For fixed vectors $x$ and $y$ , we repeatedly quantize $x$ and compute $\langle y, \tilde{x} \rangle$ . The MSE histogram (purple) is centered away from the true value — that’s the bias. The Prod histogram (green) is centered on the true value.

Loading visualization...

Algorithm 2: TurboQuant-Prod

Setup (one-time, shared)

Input: dimension $d$ , bit-width $b$

Instantiate a TurboQuant-MSE instance with bit-width $b - 1$
Generate a random projection matrix $\boldsymbol{S} \in \mathbb{R}^{d \times d}$ with i.i.d. entries $\boldsymbol{S}_{i,j} \sim \mathcal{N}(0, 1)$

Quantization: QUANT( $\boldsymbol{x}$ )

Input: $\boldsymbol{x} \in \mathbb{S}^{d-1}$

Apply MSE quantization with $b-1$ bits: $\text{idx} \leftarrow \text{QUANT}_{\text{mse}}(\boldsymbol{x})$
Compute the residual vector: $\boldsymbol{r} \leftarrow \boldsymbol{x} - \text{DEQUANT}_{\text{mse}}(\text{idx})$
Apply QJL to the residual: $\text{qjl} \leftarrow \text{sign}(\boldsymbol{S} \cdot \boldsymbol{r})$
Output: $(\text{idx},\ \text{qjl},\ \|\boldsymbol{r}\|_2)$

Storage: $(b-1) \cdot d$ bits for idx + $d$ bits for qjl + one scalar for $\|\boldsymbol{r}\|_2$ = $b \cdot d$ bits + $O(1)$ .

Dequantization: DEQUANT( $\text{idx}, \text{qjl}, \gamma$ )

Reconstruct MSE estimate: $\tilde{\boldsymbol{x}}_{\text{mse}} \leftarrow \text{DEQUANT}_{\text{mse}}(\text{idx})$
Reconstruct QJL estimate of residual: $\tilde{\boldsymbol{x}}_{\text{qjl}} \leftarrow \frac{\sqrt{\pi/2}}{d} \cdot \gamma \cdot \boldsymbol{S}^\top \cdot \text{qjl}$
Output: $\tilde{\boldsymbol{x}} = \tilde{\boldsymbol{x}}_{\text{mse}} + \tilde{\boldsymbol{x}}_{\text{qjl}}$

Two-Stage Pipeline

Step through the full Prod pipeline: MSE quantize with $b-1$ bits, compute the residual, apply QJL to the residual, then combine. Compare inner product estimates at each stage.

Loading visualization...

Quantized Johnson-Lindenstrauss (QJL)

QJL is a 1-bit quantization scheme based on random projections. For $\boldsymbol{x} \in \mathbb{R}^d$ :

Q_{\text{qjl}}(\boldsymbol{x}) := \text{sign}(\boldsymbol{S} \cdot \boldsymbol{x}) \in \{-1, +1\}^d

The dequantization map:

Q_{\text{qjl}}^{-1}(\boldsymbol{z}) := \frac{\sqrt{\pi/2}}{d} \cdot \boldsymbol{S}^\top \cdot \boldsymbol{z}

The scaling factor $\sqrt{\pi/2}/d$ comes from $\mathbb{E}[|Z|] = \sqrt{2/\pi}$ for $Z \sim \mathcal{N}(0,1)$ .

Key property: for any $\boldsymbol{x} \in \mathbb{S}^{d-1}$ and any $\boldsymbol{y} \in \mathbb{R}^d$ :

\mathbb{E}\left[\left\langle \boldsymbol{y}, Q_{\text{qjl}}^{-1}(Q_{\text{qjl}}(\boldsymbol{x})) \right\rangle\right] = \langle \boldsymbol{y}, \boldsymbol{x} \rangle

with variance bounded by $\frac{\pi}{2d} \cdot \|\boldsymbol{y}\|_2^2$ .

QJL Unbiasedness Proof

Let $\boldsymbol{s}_1, \ldots, \boldsymbol{s}_d$ be the rows of $\boldsymbol{S}$ . The inner product estimate is:

\left\langle \boldsymbol{y}, Q_{\text{qjl}}^{-1}(Q_{\text{qjl}}(\boldsymbol{x})) \right\rangle = \frac{1}{d} \sum_{i \in [d]} \sqrt{\pi/2} \cdot \boldsymbol{s}_i^\top \boldsymbol{y} \cdot \text{sign}(\boldsymbol{s}_i^\top \boldsymbol{x})

Since $\boldsymbol{s}_i$ has i.i.d. $\mathcal{N}(0,1)$ entries, the pair $(\boldsymbol{s}_i^\top \boldsymbol{y}, \boldsymbol{s}_i^\top \boldsymbol{x})$ is jointly Gaussian. By a standard Gaussian identity, for jointly Gaussian $(U, V)$ with $\text{Var}(V) = 1$ :

\mathbb{E}[U \cdot \text{sign}(V)] = \sqrt{2/\pi} \cdot \text{Cov}(U, V)

Why? Write $U = \frac{\text{Cov}(U,V)}{\text{Var}(V)} V + W$ where $W \perp V$ . Then $\mathbb{E}[U \cdot \text{sign}(V)] = \text{Cov}(U,V) \cdot \mathbb{E}[|V|] = \text{Cov}(U,V) \cdot \sqrt{2/\pi}$ .

Therefore:

\mathbb{E}\left[\boldsymbol{s}_i^\top \boldsymbol{y} \cdot \text{sign}(\boldsymbol{s}_i^\top \boldsymbol{x})\right] = \sqrt{2/\pi} \cdot \langle \boldsymbol{y}, \boldsymbol{x} \rangle

Plugging back with the $\sqrt{\pi/2}$ scaling:

\mathbb{E}\left[\left\langle \boldsymbol{y}, Q_{\text{qjl}}^{-1}(Q_{\text{qjl}}(\boldsymbol{x})) \right\rangle\right] = \frac{1}{d} \sum_{i=1}^d \sqrt{\pi/2} \cdot \sqrt{2/\pi} \cdot \langle \boldsymbol{y}, \boldsymbol{x} \rangle = \langle \boldsymbol{y}, \boldsymbol{x} \rangle \qquad \blacksquare

Theorem 2 (Performance Guarantee)

For any bit-width $b \geq 1$ and any $\boldsymbol{x} \in \mathbb{S}^{d-1}$ , $\boldsymbol{y} \in \mathbb{R}^d$ :

Unbiasedness

\mathbb{E}_{\tilde{\boldsymbol{x}}}\left[\langle \boldsymbol{y}, \tilde{\boldsymbol{x}} \rangle\right] = \langle \boldsymbol{y}, \boldsymbol{x} \rangle

Inner-Product Distortion Bound

D_{\text{prod}} := \mathbb{E}_{\tilde{\boldsymbol{x}}}\left[\left|\langle \boldsymbol{y}, \boldsymbol{x} \rangle - \langle \boldsymbol{y}, \tilde{\boldsymbol{x}} \rangle\right|^2\right] \leq \frac{\sqrt{3}\pi^2 \cdot \|\boldsymbol{y}\|_2^2}{d} \cdot \frac{1}{4^b}

For small bit-widths:

$b$	$D_{\text{prod}}$ (times $\\|\boldsymbol{y}\\|_2^2 / d$ )
1	$1.57$
2	$0.56$
3	$0.18$
4	$0.047$

Full Proof of Theorem 2

Part A: Unbiasedness

Step 1: Since $\tilde{\boldsymbol{x}} = \tilde{\boldsymbol{x}}_{\text{mse}} + \tilde{\boldsymbol{x}}_{\text{qjl}}$ , condition on $\tilde{\boldsymbol{x}}_{\text{mse}}$ :

\mathbb{E}\left[\langle \boldsymbol{y}, \tilde{\boldsymbol{x}} \rangle \mid \tilde{\boldsymbol{x}}_{\text{mse}}\right] = \langle \boldsymbol{y}, \tilde{\boldsymbol{x}}_{\text{mse}} \rangle + \mathbb{E}\left[\langle \boldsymbol{y}, \tilde{\boldsymbol{x}}_{\text{qjl}} \rangle \mid \tilde{\boldsymbol{x}}_{\text{mse}}\right]

Step 2: QJL is unbiased, so conditioned on $\tilde{\boldsymbol{x}}_{\text{mse}}$ (which fixes the residual $\boldsymbol{r}$ ):

\mathbb{E}\left[\langle \boldsymbol{y}, \tilde{\boldsymbol{x}}_{\text{qjl}} \rangle \mid \tilde{\boldsymbol{x}}_{\text{mse}}\right] = \langle \boldsymbol{y}, \boldsymbol{r} \rangle

Step 3: Since $\boldsymbol{r} = \boldsymbol{x} - \tilde{\boldsymbol{x}}_{\text{mse}}$ :

\mathbb{E}\left[\langle \boldsymbol{y}, \tilde{\boldsymbol{x}} \rangle \mid \tilde{\boldsymbol{x}}_{\text{mse}}\right] = \langle \boldsymbol{y}, \tilde{\boldsymbol{x}}_{\text{mse}} \rangle + \langle \boldsymbol{y}, \boldsymbol{x} - \tilde{\boldsymbol{x}}_{\text{mse}} \rangle = \langle \boldsymbol{y}, \boldsymbol{x} \rangle

By the law of total expectation: $\mathbb{E}[\langle \boldsymbol{y}, \tilde{\boldsymbol{x}} \rangle] = \langle \boldsymbol{y}, \boldsymbol{x} \rangle$ $\blacksquare$

Part B: Distortion Bound

Step 1: The conditional distortion equals the QJL variance:

\mathbb{E}\left[\left|\langle \boldsymbol{y}, \boldsymbol{r} \rangle - \langle \boldsymbol{y}, \tilde{\boldsymbol{x}}_{\text{qjl}} \rangle\right|^2 \mid \tilde{\boldsymbol{x}}_{\text{mse}}\right] = \text{Var}\left(\langle \boldsymbol{y}, \tilde{\boldsymbol{x}}_{\text{qjl}} \rangle \mid \tilde{\boldsymbol{x}}_{\text{mse}}\right)

Step 2: QJL variance bound with rescaling ( $\|\boldsymbol{r}\|$ is fixed given $\tilde{\boldsymbol{x}}_{\text{mse}}$ ):

\leq \frac{\pi}{2d} \cdot \|\boldsymbol{r}\|_2^2 \cdot \|\boldsymbol{y}\|_2^2

Step 3: Take expectation over $\tilde{\boldsymbol{x}}_{\text{mse}}$ . Note $\mathbb{E}[\|\boldsymbol{r}\|_2^2] = D_{\text{mse}}(b-1)$ :

D_{\text{prod}} \leq \frac{\pi}{2d} \cdot \|\boldsymbol{y}\|_2^2 \cdot D_{\text{mse}}(b-1)

Step 4: By Theorem 1, $D_{\text{mse}}(b-1) \leq \frac{\sqrt{3}\pi}{2} \cdot \frac{4}{4^b}$ :

D_{\text{prod}} \leq \frac{\pi}{2d} \cdot \|\boldsymbol{y}\|_2^2 \cdot \frac{2\sqrt{3}\pi}{4^b} = \frac{\sqrt{3}\pi^2 \cdot \|\boldsymbol{y}\|_2^2}{d} \cdot \frac{1}{4^b} \qquad \blacksquare

Comparing with the Lower Bound

The information-theoretic lower bound on inner-product distortion is:

D_{\text{prod}} \geq \frac{\|\boldsymbol{y}\|_2^2}{d} \cdot \frac{1}{4^b}

TurboQuant-Prod is within a factor of $\sqrt{3}\pi^2 \approx 17.1$ of this bound. The gap is larger than for MSE because the bound chains two levels of approximation (MSE bound + QJL variance bound).