Skip to Content

TurboQuant

Work in progress — this series is actively being written.

TurboQuant  introduced two online algorithms to quantize vectors while achieving optiomal distortion rates (in both MSE and inner product) acorss all bit widths and dimensions.

The algorithm technique is

  1. Multiply every input vector with a shared random rotation matrix Π\boldsymbol{\Pi}.
  2. After rotation, each coordinate independently follows a known beta-distribution.
  3. Since coordinates are independent and follow a know distribution, design an optimal scalar quantizer for that distribution.

What is a Beta Distribution?!

Loading visualization...

Coordinate distribution

The key insight: for a vector uniformly distributed on Sd1\mathbb{S}^{d-1}, each coordinate independently follows Beta ⁣(d12,d12)\text{Beta}\!\left(\frac{d-1}{2}, \frac{d-1}{2}\right) mapped to [1,1][-1, 1]. As dd grows, this converges to N(0,1/d)\mathcal{N}(0, 1/d).

Simulation


Generate uniformly distributed vectors on a unit sphere and plot the coordinates. Click Stream to watch empirical coordinate samples converge to the theoretical distribution. Drag the dimension slider to see how the distribution concentrates in high dimensions.


Loading visualization...

Formal Proof

Proof

Step 1: The Slicing Argument

Consider the unit sphere Sd1={xRd:x=1}\mathbb{S}^{d-1} = \{\boldsymbol{x} \in \mathbb{R}^d : \|\boldsymbol{x}\| = 1\}. Fix a coordinate index jj (by symmetry, the choice of jj doesn’t matter — let’s take j=1j = 1 for concreteness).

Slice the sphere at height x1=xx_1 = x for some fixed x(1,1)x \in (-1, 1). The slice is the set of all points on Sd1\mathbb{S}^{d-1} whose first coordinate equals xx:

Slice(x)={(x,x2,,xd):x2+x22++xd2=1}\text{Slice}(x) = \{(x, x_2, \ldots, x_d) : x^2 + x_2^2 + \cdots + x_d^2 = 1\}

The constraint x22++xd2=1x2x_2^2 + \cdots + x_d^2 = 1 - x^2 means the remaining d1d-1 coordinates lie on a sphere of radius r(x)=1x2r(x) = \sqrt{1 - x^2} in Rd1\mathbb{R}^{d-1}. So:

Slice(x)Sd2(1x2)(a (d2)-sphere of radius 1x2)\text{Slice}(x) \cong \mathbb{S}^{d-2}\left(\sqrt{1-x^2}\right) \quad \text{(a $(d-2)$-sphere of radius $\sqrt{1-x^2}$)}
Step 2: Surface Area of a Sphere

We need the surface area of Sm1(r)\mathbb{S}^{m-1}(r), the (m1)(m-1)-dimensional sphere of radius rr in Rm\mathbb{R}^m. The formula is:

Am(r)=2πm/2Γ(m/2)rm1A_m(r) = \frac{2\pi^{m/2}}{\Gamma(m/2)} \cdot r^{m-1}
Step 3: From Surface Area to Density

Now we derive fX(x)f_X(x). The key idea: “uniform on Sd1\mathbb{S}^{d-1}” means probability is proportional to surface area. So the probability that x1[x,x+dx]x_1 \in [x, x + dx] is proportional to the surface area of the thin strip of Sd1\mathbb{S}^{d-1} between heights x1=xx_1 = x and x1=x+dxx_1 = x + dx.

The strip is NOT a flat disk. This is the subtle point. The sphere surface between x1=xx_1 = x and x1=x+dxx_1 = x + dx is a thin band on a curved surface. We need to account for the tilt of the surface relative to the horizontal.

Computing the strip width. Consider a point on Sd1\mathbb{S}^{d-1} at height x1=xx_1 = x. As we move along the sphere surface to increase x1x_1 by dxdx, we travel a distance dsds along the sphere. These are related by the geometry of the sphere:

At height x1=xx_1 = x, the sphere surface makes an angle with the horizontal. The relationship between the infinitesimal height change dxdx and the arc length dsds along the sphere is:

ds=dx1x2ds = \frac{dx}{\sqrt{1 - x^2}}

Why? Parameterize the sphere locally. A point on Sd1\mathbb{S}^{d-1} at height x1=xx_1 = x satisfies x12+x2:d2=1x_1^2 + \|\boldsymbol{x}_{2:d}\|^2 = 1. If we increase x1x_1 by dxdx while staying on the sphere, the transverse radius changes by dr=xdx1x2dr = \frac{-x\,dx}{\sqrt{1-x^2}}, and the total displacement along the sphere surface has length:

ds=dx2+dr2=dx1+x21x2=dx1x2ds = \sqrt{dx^2 + dr^2} = dx\sqrt{1 + \frac{x^2}{1-x^2}} = \frac{dx}{\sqrt{1-x^2}}

(We used 1+x21x2=11x21 + \frac{x^2}{1-x^2} = \frac{1}{1-x^2}.)

Area of the strip. The strip at height xx has:

  • “Circumference”: the (d2)(d-2)-dimensional area of the cross-section, which is Ad1(1x2)A_{d-1}(\sqrt{1-x^2})
  • Width (along the sphere surface): ds=dx1x2ds = \frac{dx}{\sqrt{1-x^2}}

So the surface area of the strip is:

d(area)=Ad1 ⁣(1x2)dx1x2d(\text{area}) = A_{d-1}\!\left(\sqrt{1-x^2}\right) \cdot \frac{dx}{\sqrt{1-x^2}}

The density is this strip area divided by the total surface area of Sd1\mathbb{S}^{d-1}:

fX(x)dx=Ad1 ⁣(1x2)dx1x2Ad(1)f_X(x)\,dx = \frac{A_{d-1}\!\left(\sqrt{1-x^2}\right) \cdot \frac{dx}{\sqrt{1-x^2}}}{A_d(1)}
Step 4: Substituting the Area Formula

Plug in Am(r)=2πm/2Γ(m/2)rm1A_m(r) = \frac{2\pi^{m/2}}{\Gamma(m/2)} \cdot r^{m-1}:

Numerator (cross-section area ×\times Jacobian):

Ad1 ⁣(1x2)11x2=2π(d1)/2Γ((d1)/2)(1x2)d211x2A_{d-1}\!\left(\sqrt{1-x^2}\right) \cdot \frac{1}{\sqrt{1-x^2}} = \frac{2\pi^{(d-1)/2}}{\Gamma((d-1)/2)} \cdot \left(\sqrt{1-x^2}\right)^{d-2} \cdot \frac{1}{\sqrt{1-x^2}}=2π(d1)/2Γ((d1)/2)(1x2)(d2)/2(1x2)1/2= \frac{2\pi^{(d-1)/2}}{\Gamma((d-1)/2)} \cdot (1-x^2)^{(d-2)/2} \cdot (1-x^2)^{-1/2}=2π(d1)/2Γ((d1)/2)(1x2)(d3)/2= \frac{2\pi^{(d-1)/2}}{\Gamma((d-1)/2)} \cdot (1-x^2)^{(d-3)/2}

(We combined the exponents: d2212=d32\frac{d-2}{2} - \frac{1}{2} = \frac{d-3}{2}.)

Denominator (total sphere area):

Ad(1)=2πd/2Γ(d/2)A_d(1) = \frac{2\pi^{d/2}}{\Gamma(d/2)}
Step 5: Simplification
fX(x)=2π(d1)/2Γ((d1)/2)(1x2)(d3)/22πd/2Γ(d/2)f_X(x) = \frac{\frac{2\pi^{(d-1)/2}}{\Gamma((d-1)/2)} \cdot (1-x^2)^{(d-3)/2}}{\frac{2\pi^{d/2}}{\Gamma(d/2)}}

The 22‘s cancel. For the π\pi terms:

π(d1)/2πd/2=π(d1)/2d/2=π1/2=1π\frac{\pi^{(d-1)/2}}{\pi^{d/2}} = \pi^{(d-1)/2 - d/2} = \pi^{-1/2} = \frac{1}{\sqrt{\pi}}

For the Γ\Gamma terms, the denominator’s Γ(d/2)\Gamma(d/2) moves to the numerator and vice versa:

fX(x)=Γ(d/2)πΓ((d1)/2)(1x2)(d3)/2f_X(x) = \frac{\Gamma(d/2)}{\sqrt{\pi} \cdot \Gamma((d-1)/2)} \left(1-x^2\right)^{(d-3)/2} \qquad \blacksquare
Connection to the Standard Beta Distribution

The standard Beta distribution Beta(α,β)\text{Beta}(\alpha, \beta) has density on [0,1][0, 1]:

g(t;α,β)=Γ(α+β)Γ(α)Γ(β)tα1(1t)β1,t[0,1]g(t; \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \, t^{\alpha - 1}(1 - t)^{\beta - 1}, \quad t \in [0, 1]

with mean αα+β\frac{\alpha}{\alpha+\beta} and variance αβ(α+β)2(α+β+1)\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}.

Mapping from fXf_X to Beta\text{Beta}. The density fX(x)(1x2)(d3)/2f_X(x) \propto (1 - x^2)^{(d-3)/2} lives on [1,1][-1, 1]. To see the Beta connection, substitute t=x+12t = \frac{x + 1}{2} (mapping [1,1][0,1][-1, 1] \to [0, 1]):

1x2=1(2t1)2=4t(1t)1 - x^2 = 1 - (2t - 1)^2 = 4t(1 - t)(1x2)(d3)/2=4(d3)/2[t(1t)](d3)/2\Rightarrow (1 - x^2)^{(d-3)/2} = 4^{(d-3)/2}\, [t(1-t)]^{(d-3)/2}

Absorbing the constant 4(d3)/224^{(d-3)/2} \cdot 2 (Jacobian dx=2dtdx = 2\,dt) into the normalization:

fT(t)t(d3)/2(1t)(d3)/2=tα1(1t)β1f_T(t) \propto t^{(d-3)/2}(1 - t)^{(d-3)/2} = t^{\alpha - 1}(1 - t)^{\beta - 1}

with α=β=d12\alpha = \beta = \frac{d - 1}{2}. Therefore:

xj+12Beta ⁣(d12,d12)\boxed{\frac{\boldsymbol{x}_j + 1}{2} \sim \text{Beta}\!\left(\frac{d-1}{2},\, \frac{d-1}{2}\right)}

This is a symmetric Beta distribution (since α=β\alpha = \beta), centered at t=1/2t = 1/2 (i.e., x=0x = 0).

[!note] Why “Beta” matters The symmetric Beta form tells us that fXf_X is unimodal, sub-Gaussian, and supported on a compact interval. These properties guarantee that the Lloyd-Max quantizer converges quickly, the codebook can be precomputed exactly, and the Panter-Dite formula applies with well-controlled error.

Last updated on