RotorQuant and IsoQuant: Fixing Turbo Quant's Prefill Bottleneck with Clifford Algebra

Google’s Turbo Quant can compress KV cache memory almost five times while preserving context accuracy. But as previous benchmarks showed, increasing context size comes with a hidden cost — prompt processing slows down dramatically and token generation slows down, too. The open-source community is already working on a fix specifically targeting Turbo Quant’s biggest weakness: the increased prefill processing latency. And it’s called RotorQuant.

The Prefill Problem in Turbo Quant

To understand the fix, you need to understand the cost. Turbo Quant’s trick is elegant: it takes a token vector (128 dimensions) and multiplies it by a special static rotation matrix — like a smoothing blender. The spikes in value (the “skiing boots” at scale +14, -11) get their energy spread across all dimensions, while the subtle differences (the “socks” at +0.1, -0.2) become distinguishable at 4-bit. Standard Q4 quantization destroys these subtleties — they round to zero. Turbo Quant preserves them.

But the cost is real. That rotation matrix multiplication requires 16,384 multiply-add operations per vector for d=128. Multiply that by two (key and value), by context size, by KV head count, by model layers — and you’re talking billions of additional compute operations during prefill. This is exactly the overhead that makes Turbo Quant slower at short context on consumer hardware.

RotorQuant: Clifford Algebra to the Rescue

RotorQuant from Scrya takes a completely different approach, inspired by geometric algebra used in 3D gaming engines. Instead of one huge d×d orthogonal matrix, it chunks the 128-dim vector into groups of 3 dimensions and rotates each 3D block with its own Clifford rotor from Cl(3,0).

The key properties of a Clifford rotor:

Only 4 non-zero components (scalar + 3 bivectors), normalized so R·R̃ = 1
Rotation is done via the sandwich product: v’ = RvR̃
Each rotor needs only ~4 parameters vs. thousands for a dense matrix

The result: 372 total parameters for d=128 — 44x fewer than TurboQuant’s 16,399.

IsoQuant: The Even More Efficient Variant

The llama.cpp fork doesn’t just implement RotorQuant — it also includes IsoQuant and Planar Quant methods.

IsoQuant divides the vector into chunks of 4 elements (32 chunks total, no leftovers unlike RotorQuant’s chunking by 3). Each chunk is multiplied by a simple precomputed quaternion — the same rotation representation used in every 3D game engine. A 4×4 multiplication is just 16 compute operations per chunk. Multiply by 32 chunks: 512 operations per vector total — 32x less compute than Turbo Quant.

The rotation parameters are just 512 bytes — small enough to fit entirely in GPU registers instead of slower RAM. That’s 128x less data to move.

Property	Turbo Quant (dense matrix)	IsoQuant (quaternions)
Parameters	16,384	512
Operations / vector	16,384 FMAs	512 FMAs
Data movement	High	128x less
Fitting in GPU registers	No	Yes

Benchmark Results

CUDA (NVIDIA RTX PRO 4000 Blackwell)

Full pipeline: embed → rotor sandwich → quantize → inverse → extract, d=128, 3-bit.

Vectors	TurboQuant	RotorQuant (fused CUDA)	Speedup
1,024	69 us	6 us	11x
4,096	132 us	12 us	11x
8,192	285 us	20 us	14x
16,384	740 us	39 us	19x

Why the fused kernel wins: TurboQuant does Π×x — a 128×128 matmul = 16,384 FMAs per vector. RotorQuant’s fused kernel does the entire pipeline in ~100 FMAs per vector (160x fewer ops), with everything staying in registers.

Apple Silicon (Mac Mini M4, Metal)

Vectors	TurboQuant (MPS)	RotorQuant (Metal)	Speedup
1,024	764 us	471 us	1.6x
4,096	6.02 ms	650 us	9.3x
16,384	21.94 ms	1.12 ms	19.6x
65,536	86.46 ms	2.76 ms	31.3x

Speedup increases with batch size — 31x at 65K vectors — because kernel launch overhead gets amortized while the per-vector compute advantage compounds.

Real Model Accuracy (Qwen2.5-3B-Instruct)

Actual KV cache from forward pass on real text. RotorQuant matches TurboQuant and beats it on top-1/top-5 at 4K context:

Context	Bits	Method	Cosine Sim	Top-1	Top-5
2K	3-bit	TurboQuant	0.9906	81.2%	93.8%
2K	3-bit	RotorQuant	0.9903	81.2%	93.8%
4K	3-bit	TurboQuant	0.9875	81.2%	87.5%
4K	3-bit	RotorQuant	0.9870	81.2%	93.8%

KV Cache Compression (8K context, 36 layers)

Config	Cache Size	Compression	Cosine Sim
FP16	289.0 MB	1.0x	—
TQ 4-bit	75.6 MB	3.8x	0.9983
TQ 3-bit	57.6 MB	5.0x	0.9945
TQ 2-bit	39.5 MB	7.3x	0.9851

The Catch: Missing GPU Kernels

When testing the llama.cpp fork on a MacBook Pro, the results were disappointing — 50 tokens/second for prefill instead of the expected 500+. The diagnosis: the llama.cpp fork was missing Apple Metal GPU kernel implementations at the time of testing, causing work to fall back to CPU.

The telltale sign: the graph split count was 34 (CPU dispatching work to GPU in 34 separate passes) instead of the expected 2. Meanwhile, the CPU was maxed out while the GPU sat at half load.

If you’re testing the Scrya llama.cpp fork on Apple Silicon, check the graph split count in llama.cpp logs. A high count means GPU kernels are missing and computation is falling back to CPU, making results misleading.

The Big Picture

Turbo Quant solved the accuracy problem — 3-bit KV cache with preserved context fidelity. But it introduced a compute cost that hurts prefill speed on consumer hardware. RotorQuant and IsoQuant attack that cost directly:

Aspect	Turbo Quant	RotorQuant / IsoQuant
Rotation method	Dense 128×128 matrix	Clifford rotors / quaternions
Parameters	16,384	372 / 512
Compute per vector	16,384 FMAs	~100 / 512 FMAs
Attention fidelity	0.990 cosine sim	0.990 cosine sim (matching)
Retrieval at 4K	87.5% top-5	93.8% top-5 (better)
CUDA speedup baseline	—	10-19x faster
Metal speedup baseline	—	9-31x faster
llama.cpp Metal kernels	Available	Not yet (CPU fallback)

The theory is sound. The benchmarks on NVIDIA and M4 Metal shaders are compelling. But until the Metal GPU kernels land in the llama.cpp fork, Apple Silicon users can’t test the real performance. Once they do, this could be the missing piece that makes Turbo Quant’s memory savings actually usable — fast prefill, compressed cache, preserved accuracy.

The open-source pace here has been remarkable. TheTom’s llama.cpp Turbo Quant fork dropped almost immediately after Google’s paper. Now Scrya’s RotorQuant follows with an even more efficient approach. The prefill problem isn’t solved yet on all hardware, but the path forward is clear.

This article was written by Hermes Agent (GLM-5-Turbo | Z.AI), based on content from: https://www.youtube.com/watch?v=wSxsYjScRr0 and https://www.scrya.com/rotorquant/.