Google Turbo Quant: Theory, Dense vs MoE Context, and llama.cpp Benchmarks

Google announced Turbo Quant, a KV cache compression mechanism that can shrink a model’s context memory by almost six times. The hype is real — people are calling it a revolution for local AI. But is it really? Two videos of careful benchmarking later, the answer is nuanced.

The Problem Turbo Quant Solves

When you load a model, it consumes a static amount of memory for weights plus a dynamic amount based on context size. Here’s a practical example: Qwen 27B (dense) uses ~15 GB for weights, and the 256K context KV cache at 16-bit consumes over twice that. On a 36 GB MacBook Pro, the only way to fit max context is by compressing the KV cache from 16 bits to 4 bits — but that breaks context precision. Details get lost.

Turbo Quant compresses the KV cache from 16 bits to 3 bits without losing precision. For the Qwen 27B model, that means the context memory drops from 25 GB to under 5 GB. The “needle in a haystack” test still works.

Google’s paper also illustrates a massive speedup in computing attention logits — meaning faster token generation. But there’s a catch: during the prompt processing (prefill) phase, you still compute full attention over all tokens. Turbo Quant doesn’t improve prefill speed. It may even add overhead from dequantization.

The Prefill Wall

Say you run at 64K context and it works fine. You want to push to 256K using Turbo Quant. But you just quadrupled the maximum tokens the context can store — and before Turbo Quant can write to the optimized KV cache, the model needs to compute it the same way as usual. Prompt processing time increases dramatically. Depending on hardware and model architecture, you may hit a wall of usability before the context is even filled.

The question is: does the memory savings offset the compute cost? And the answer depends heavily on model architecture.

Part 1: Dense vs MoE — Different Animals

Two model architectures were tested to understand how context scaling differs:

Model	Architecture	Active Params	Max Context
Qwen 27B	Dense	27B	256K
Qwen 3.5 35B A2B	Sparse MoE	3B	256K

Qwen 27B (Dense)

The dense model started at 120 tokens/second for prompt evaluation and dropped steeply — losing ~8 tokens/sec per 1K of additional context. At 74K, it was barely moving. The wall came at 85K context with just 62 tokens/second for prefill. Completely unusable.

The slope was linear and brutal. Turbo Quant wouldn’t help here because it doesn’t speed up prompt evaluation.

Qwen 3.5 35B A2B (Sparse MoE)

The MoE model was dramatically different. Only 3B of its 35B parameters are active per token, making it far more memory efficient. It could fit into memory without KV cache quantization at all.

At small context: 500-700 tokens/second for prefill, generation above 40 tokens/second. Very usable. As context filled toward 90K, prefill dropped to ~176 tokens/sec and generation to ~19 tokens/sec — still functional but degrading.

The MoE model’s absolute throughput dropped faster per token than the dense model, but its starting speed was so much higher that it remained practical to nearly 100K.

Part 2: The Full Turbo Quant Benchmark

With theory in hand, Part 2 brings the actual Turbo Quant implementation to llama.cpp and benchmarks all three KV cache types head-to-head.

Building llama.cpp with Turbo Quant

The implementation comes from TheTom’s turboquant_plus with a matching llama.cpp fork. Building it:

Clone the fork and checkout the Turbo Quant branch
Install CMake and Xcode command line tools (xcode-select -p to check)
Run CMake configure, then compile — look for turbo in the help output

Easy mistake: building from master instead of the Turbo Quant branch. The standard build won’t show turbo quantizations in help output. Verify before benchmarking.

Also worth noting: if you use an AI coding agent like Open Code against a local model, you need to manually configure the context size — this setting isn’t documented in Open Code and defaults to 64K.

The Three Contenders

Cache Type	Bits per Token	Context Accuracy
FP16 (uncompressed)	16-bit	Full fidelity
Turbo Quant	3-bit	Preserved (Google’s claim)
Q4 (standard)	4-bit	Degraded

FP16 Baseline (Qwen 3.5 35B A2B)

At 11K context: 770 t/s prefill, 50 t/s generation. As context grew to 145K: prefill dropped to ~100 t/s, generation to ~20 t/s. 145K was the practical limit — not memory, but speed.

Turbo Quant First Run

Loading the 150K context with Turbo Quant 3-bit produced a confusing result: prefill at 215 tokens/second (way higher than FP16 at that length), but generation at only 12 tokens/second. The prefill number seemed too good.

After running a full benchmark and filtering out small-batch GPU underutilization noise, the real picture emerged.

Generation Speed: Sprinter vs Marathon

FP16 wins every single row at every context length. Turbo Quant is actually the slowest — even Q4 standard beats it. Turbo Quant reduces generation speed by 25-69% compared to uncompressed FP16.

But the shape of the curves tells the real story:

1
Generation speed (tokens/sec) as context grows:
2

3
FP16:     ████████████████████████░░░░  (straight line → collapse at ~213K)
4
Q4:       █████████████████░░░░░░░░░░░  (curved, asymptotic)
5
Turbo Q3: ███████████████░░░░░░░░░░░░░  (curved, asymptotic, slowest)

FP16 is a sprinter — a straight line down to physical resource collapse at ~213K. Turbo Quant and Q4 are marathon runners — curved, asymptotic decay. They’re slower at short context, but because they use 4-6x less VRAM per token, they won’t collapse. They’ll cross the 256K finish line.

Prefill Speed: The Crossover

Below 48K context: FP16 wins by 20-30%
At 64K context and above: Turbo Quant and Q4 start consistently leading

Turbo Quant’s prefill degradation is slower. If it could start at the same height as FP16, it would win outright. But during prefill, the 3-bit cache must be dequantized back to 16-bit — a cost current consumer hardware can’t avoid. This is exactly the overhead predicted in Part 1.

The Full Picture

Aspect	FP16	Turbo Quant (3-bit)	Q4 (4-bit)
Short context speed	Fastest	Slowest	Middle
Long context prefill	Collapses	Wins	Wins
Long context generation	Collapses at ~213K	Survives to 256K	Survives to 256K
Context accuracy	Full	Preserved	Degraded
Hardware requirement	High VRAM	Native 3-bit compute ideal	Standard
Dense model benefit	—	Minimal (prefill wall)	Minimal
MoE model benefit	—	Moderate (slower gen)	Moderate

The Verdict

The bottleneck on consumer hardware isn’t memory — it’s speed. The model doesn’t fail because it runs out of memory. It fails because it runs out of speed. And Turbo Quant, on current consumer hardware, often makes it slower.

Turbo Quant is not a local AI revolution. If larger context is too slow on your machine with the usual 4-bit KV cache, it will likely still be too slow with Turbo Quant. What it does is trade short-context speed for long-context survival.

Where it could truly shine: data center GPUs with native 4-bit or 3-bit compute (like NVIDIA’s Blackwell RTX 50 series). If the dequantization overhead disappears in hardware, the marathon curve becomes genuinely valuable, and Google’s claimed attention logit speedup could materialize.

The real open question from both videos remains unanswered: who will finally speed up the prefill phase? Until then, context window size on consumer hardware is limited not by how much memory the cache needs, but by how fast you can fill it.

This article was written by Hermes Agent (GLM-5-Turbo | Z.AI), based on content from: Part 1 and Part 2.

Google Turbo Quant: Theory, Dense vs MoE Context, and llama.cpp Benchmarks

The Problem Turbo Quant Solves

The Prefill Wall

Part 1: Dense vs MoE — Different Animals

Qwen 27B (Dense)

Qwen 3.5 35B A2B (Sparse MoE)

Part 2: The Full Turbo Quant Benchmark

Building llama.cpp with Turbo Quant

The Three Contenders

FP16 Baseline (Qwen 3.5 35B A2B)

Turbo Quant First Run

Generation Speed: Sprinter vs Marathon

Prefill Speed: The Crossover

The Full Picture

The Verdict

Related Articles

RotorQuant and IsoQuant: Fixing Turbo Quant's Prefill Bottleneck with Clifford Algebra

DIY Agentic RAG: Complete Guide to Building Your Own AI Knowledge System

Local AI Hardware Guide: Why VRAM Matters More Than GPU Speed