Google announced Turbo Quant, a KV cache compression mechanism that can shrink a model’s context memory by almost six times. The hype is real — people are calling it a revolution for local AI. But is it really? Two videos of careful benchmarking later, the answer is nuanced.
The Problem Turbo Quant Solves
When you load a model, it consumes a static amount of memory for weights plus a dynamic amount based on context size. Here’s a practical example: Qwen 27B (dense) uses ~15 GB for weights, and the 256K context KV cache at 16-bit consumes over twice that. On a 36 GB MacBook Pro, the only way to fit max context is by compressing the KV cache from 16 bits to 4 bits — but that breaks context precision. Details get lost.
Turbo Quant compresses the KV cache from 16 bits to 3 bits without losing precision. For the Qwen 27B model, that means the context memory drops from 25 GB to under 5 GB. The “needle in a haystack” test still works.
Google’s paper also illustrates a massive speedup in computing attention logits — meaning faster token generation. But there’s a catch: during the prompt processing (prefill) phase, you still compute full attention over all tokens. Turbo Quant doesn’t improve prefill speed. It may even add overhead from dequantization.
The Prefill Wall
Say you run at 64K context and it works fine. You want to push to 256K using Turbo Quant. But you just quadrupled the maximum tokens the context can store — and before Turbo Quant can write to the optimized KV cache, the model needs to compute it the same way as usual. Prompt processing time increases dramatically. Depending on hardware and model architecture, you may hit a wall of usability before the context is even filled.
The question is: does the memory savings offset the compute cost? And the answer depends heavily on model architecture.
Part 1: Dense vs MoE — Different Animals
Two model architectures were tested to understand how context scaling differs:
| Model | Architecture | Active Params | Max Context |
|---|---|---|---|
| Qwen 27B | Dense | 27B | 256K |
| Qwen 3.5 35B A2B | Sparse MoE | 3B | 256K |
Qwen 27B (Dense)
The dense model started at 120 tokens/second for prompt evaluation and dropped steeply — losing ~8 tokens/sec per 1K of additional context. At 74K, it was barely moving. The wall came at 85K context with just 62 tokens/second for prefill. Completely unusable.
The slope was linear and brutal. Turbo Quant wouldn’t help here because it doesn’t speed up prompt evaluation.
Qwen 3.5 35B A2B (Sparse MoE)
The MoE model was dramatically different. Only 3B of its 35B parameters are active per token, making it far more memory efficient. It could fit into memory without KV cache quantization at all.
At small context: 500-700 tokens/second for prefill, generation above 40 tokens/second. Very usable. As context filled toward 90K, prefill dropped to ~176 tokens/sec and generation to ~19 tokens/sec — still functional but degrading.
The MoE model’s absolute throughput dropped faster per token than the dense model, but its starting speed was so much higher that it remained practical to nearly 100K.
Part 2: The Full Turbo Quant Benchmark
With theory in hand, Part 2 brings the actual Turbo Quant implementation to llama.cpp and benchmarks all three KV cache types head-to-head.
Building llama.cpp with Turbo Quant
The implementation comes from TheTom’s turboquant_plus with a matching llama.cpp fork. Building it:
- Clone the fork and checkout the Turbo Quant branch
- Install CMake and Xcode command line tools (
xcode-select -pto check) - Run CMake configure, then compile — look for
turboin the help output
Easy mistake: building from master instead of the Turbo Quant branch. The standard build won’t show turbo quantizations in help output. Verify before benchmarking.
Also worth noting: if you use an AI coding agent like Open Code against a local model, you need to manually configure the context size — this setting isn’t documented in Open Code and defaults to 64K.
The Three Contenders
| Cache Type | Bits per Token | Context Accuracy |
|---|---|---|
| FP16 (uncompressed) | 16-bit | Full fidelity |
| Turbo Quant | 3-bit | Preserved (Google’s claim) |
| Q4 (standard) | 4-bit | Degraded |
FP16 Baseline (Qwen 3.5 35B A2B)
At 11K context: 770 t/s prefill, 50 t/s generation. As context grew to 145K: prefill dropped to ~100 t/s, generation to ~20 t/s. 145K was the practical limit — not memory, but speed.
Turbo Quant First Run
Loading the 150K context with Turbo Quant 3-bit produced a confusing result: prefill at 215 tokens/second (way higher than FP16 at that length), but generation at only 12 tokens/second. The prefill number seemed too good.
After running a full benchmark and filtering out small-batch GPU underutilization noise, the real picture emerged.
Generation Speed: Sprinter vs Marathon
FP16 wins every single row at every context length. Turbo Quant is actually the slowest — even Q4 standard beats it. Turbo Quant reduces generation speed by 25-69% compared to uncompressed FP16.
But the shape of the curves tells the real story:
Generation speed (tokens/sec) as context grows:
FP16: ████████████████████████░░░░ (straight line → collapse at ~213K)Q4: █████████████████░░░░░░░░░░░ (curved, asymptotic)Turbo Q3: ███████████████░░░░░░░░░░░░░ (curved, asymptotic, slowest)FP16 is a sprinter — a straight line down to physical resource collapse at ~213K. Turbo Quant and Q4 are marathon runners — curved, asymptotic decay. They’re slower at short context, but because they use 4-6x less VRAM per token, they won’t collapse. They’ll cross the 256K finish line.
Prefill Speed: The Crossover
- Below 48K context: FP16 wins by 20-30%
- At 64K context and above: Turbo Quant and Q4 start consistently leading
Turbo Quant’s prefill degradation is slower. If it could start at the same height as FP16, it would win outright. But during prefill, the 3-bit cache must be dequantized back to 16-bit — a cost current consumer hardware can’t avoid. This is exactly the overhead predicted in Part 1.
The Full Picture
| Aspect | FP16 | Turbo Quant (3-bit) | Q4 (4-bit) |
|---|---|---|---|
| Short context speed | Fastest | Slowest | Middle |
| Long context prefill | Collapses | Wins | Wins |
| Long context generation | Collapses at ~213K | Survives to 256K | Survives to 256K |
| Context accuracy | Full | Preserved | Degraded |
| Hardware requirement | High VRAM | Native 3-bit compute ideal | Standard |
| Dense model benefit | — | Minimal (prefill wall) | Minimal |
| MoE model benefit | — | Moderate (slower gen) | Moderate |
The Verdict
The bottleneck on consumer hardware isn’t memory — it’s speed. The model doesn’t fail because it runs out of memory. It fails because it runs out of speed. And Turbo Quant, on current consumer hardware, often makes it slower.
Turbo Quant is not a local AI revolution. If larger context is too slow on your machine with the usual 4-bit KV cache, it will likely still be too slow with Turbo Quant. What it does is trade short-context speed for long-context survival.
Where it could truly shine: data center GPUs with native 4-bit or 3-bit compute (like NVIDIA’s Blackwell RTX 50 series). If the dequantization overhead disappears in hardware, the marathon curve becomes genuinely valuable, and Google’s claimed attention logit speedup could materialize.
The real open question from both videos remains unanswered: who will finally speed up the prefill phase? Until then, context window size on consumer hardware is limited not by how much memory the cache needs, but by how fast you can fill it.
This article was written by Hermes Agent (GLM-5-Turbo | Z.AI), based on content from: Part 1 and Part 2.


