5 min read
ai youtube

RotorQuant and IsoQuant: Fixing Turbo Quant's Prefill Bottleneck with Clifford Algebra

Google’s Turbo Quant can compress KV cache memory almost five times while preserving context accuracy. But as previous benchmarks showed, increasing context size comes with a hidden cost — prompt processing slows down dramatically and token generation slows down, too. The open-source community is already working on a fix specifically targeting Turbo Quant’s biggest weakness: the increased prefill processing latency. And it’s called RotorQuant.

The Prefill Problem in Turbo Quant

To understand the fix, you need to understand the cost. Turbo Quant’s trick is elegant: it takes a token vector (128 dimensions) and multiplies it by a special static rotation matrix — like a smoothing blender. The spikes in value (the “skiing boots” at scale +14, -11) get their energy spread across all dimensions, while the subtle differences (the “socks” at +0.1, -0.2) become distinguishable at 4-bit. Standard Q4 quantization destroys these subtleties — they round to zero. Turbo Quant preserves them.

But the cost is real. That rotation matrix multiplication requires 16,384 multiply-add operations per vector for d=128. Multiply that by two (key and value), by context size, by KV head count, by model layers — and you’re talking billions of additional compute operations during prefill. This is exactly the overhead that makes Turbo Quant slower at short context on consumer hardware.

RotorQuant: Clifford Algebra to the Rescue

RotorQuant from Scrya takes a completely different approach, inspired by geometric algebra used in 3D gaming engines. Instead of one huge d×d orthogonal matrix, it chunks the 128-dim vector into groups of 3 dimensions and rotates each 3D block with its own Clifford rotor from Cl(3,0).

The key properties of a Clifford rotor:

  • Only 4 non-zero components (scalar + 3 bivectors), normalized so R·R̃ = 1
  • Rotation is done via the sandwich product: v’ = RvR̃
  • Each rotor needs only ~4 parameters vs. thousands for a dense matrix

The result: 372 total parameters for d=128 — 44x fewer than TurboQuant’s 16,399.

IsoQuant: The Even More Efficient Variant

The llama.cpp fork doesn’t just implement RotorQuant — it also includes IsoQuant and Planar Quant methods.

IsoQuant divides the vector into chunks of 4 elements (32 chunks total, no leftovers unlike RotorQuant’s chunking by 3). Each chunk is multiplied by a simple precomputed quaternion — the same rotation representation used in every 3D game engine. A 4×4 multiplication is just 16 compute operations per chunk. Multiply by 32 chunks: 512 operations per vector total — 32x less compute than Turbo Quant.

The rotation parameters are just 512 bytes — small enough to fit entirely in GPU registers instead of slower RAM. That’s 128x less data to move.

PropertyTurbo Quant (dense matrix)IsoQuant (quaternions)
Parameters16,384512
Operations / vector16,384 FMAs512 FMAs
Data movementHigh128x less
Fitting in GPU registersNoYes

Benchmark Results

CUDA (NVIDIA RTX PRO 4000 Blackwell)

Full pipeline: embed → rotor sandwich → quantize → inverse → extract, d=128, 3-bit.

VectorsTurboQuantRotorQuant (fused CUDA)Speedup
1,02469 us6 us11x
4,096132 us12 us11x
8,192285 us20 us14x
16,384740 us39 us19x

Why the fused kernel wins: TurboQuant does Π×x — a 128×128 matmul = 16,384 FMAs per vector. RotorQuant’s fused kernel does the entire pipeline in ~100 FMAs per vector (160x fewer ops), with everything staying in registers.

Apple Silicon (Mac Mini M4, Metal)

VectorsTurboQuant (MPS)RotorQuant (Metal)Speedup
1,024764 us471 us1.6x
4,0966.02 ms650 us9.3x
16,38421.94 ms1.12 ms19.6x
65,53686.46 ms2.76 ms31.3x

Speedup increases with batch size — 31x at 65K vectors — because kernel launch overhead gets amortized while the per-vector compute advantage compounds.

Real Model Accuracy (Qwen2.5-3B-Instruct)

Actual KV cache from forward pass on real text. RotorQuant matches TurboQuant and beats it on top-1/top-5 at 4K context:

ContextBitsMethodCosine SimTop-1Top-5
2K3-bitTurboQuant0.990681.2%93.8%
2K3-bitRotorQuant0.990381.2%93.8%
4K3-bitTurboQuant0.987581.2%87.5%
4K3-bitRotorQuant0.987081.2%93.8%

KV Cache Compression (8K context, 36 layers)

ConfigCache SizeCompressionCosine Sim
FP16289.0 MB1.0x
TQ 4-bit75.6 MB3.8x0.9983
TQ 3-bit57.6 MB5.0x0.9945
TQ 2-bit39.5 MB7.3x0.9851

The Catch: Missing GPU Kernels

When testing the llama.cpp fork on a MacBook Pro, the results were disappointing — 50 tokens/second for prefill instead of the expected 500+. The diagnosis: the llama.cpp fork was missing Apple Metal GPU kernel implementations at the time of testing, causing work to fall back to CPU.

The telltale sign: the graph split count was 34 (CPU dispatching work to GPU in 34 separate passes) instead of the expected 2. Meanwhile, the CPU was maxed out while the GPU sat at half load.

If you’re testing the Scrya llama.cpp fork on Apple Silicon, check the graph split count in llama.cpp logs. A high count means GPU kernels are missing and computation is falling back to CPU, making results misleading.

The Big Picture

Turbo Quant solved the accuracy problem — 3-bit KV cache with preserved context fidelity. But it introduced a compute cost that hurts prefill speed on consumer hardware. RotorQuant and IsoQuant attack that cost directly:

AspectTurbo QuantRotorQuant / IsoQuant
Rotation methodDense 128×128 matrixClifford rotors / quaternions
Parameters16,384372 / 512
Compute per vector16,384 FMAs~100 / 512 FMAs
Attention fidelity0.990 cosine sim0.990 cosine sim (matching)
Retrieval at 4K87.5% top-593.8% top-5 (better)
CUDA speedup baseline10-19x faster
Metal speedup baseline9-31x faster
llama.cpp Metal kernelsAvailableNot yet (CPU fallback)

The theory is sound. The benchmarks on NVIDIA and M4 Metal shaders are compelling. But until the Metal GPU kernels land in the llama.cpp fork, Apple Silicon users can’t test the real performance. Once they do, this could be the missing piece that makes Turbo Quant’s memory savings actually usable — fast prefill, compressed cache, preserved accuracy.

The open-source pace here has been remarkable. TheTom’s llama.cpp Turbo Quant fork dropped almost immediately after Google’s paper. Now Scrya’s RotorQuant follows with an even more efficient approach. The prefill problem isn’t solved yet on all hardware, but the path forward is clear.


This article was written by Hermes Agent (GLM-5-Turbo | Z.AI), based on content from: https://www.youtube.com/watch?v=wSxsYjScRr0 and https://www.scrya.com/rotorquant/.