llama.cpp: Running 35B MoE on 6GB VRAM

TL;DR: Running a 35-billion-parameter MoE model on a GTX 1060 with 6GB VRAM is possible — but only with the right llama.cpp flags. Five working tricks yield 17 tokens/sec with 256K context. One failed trick (speculative decoding) and one upcoming trick (DFlash) documented for completeness.

The Setup

The video demonstrates running Qwen 3.6 35B A3B on an 8-year-old rig:

GTX 1060 6GB VRAM (PCIe Gen 3)
i3-8100 (4 cores, no hyperthreading)
24GB DDR4

This is a floor, not a ceiling. Most modern rigs will outperform these numbers.

The Problem (Baseline)

Naive approach: --n-gpu-layers 20. First 20 layers on GPU, rest on CPU.

Result: ~3 tokens/sec. Watching words appear over 20-30 seconds is unusable — you’re in “satellite phone” territory.

Why so slow? Every layer carries its expert blocks. When layers sit on CPU, data must cross PCIe per token, choking the bus.

Trick 1: Push MoE Experts to CPU

Flags: --n-gpu-layers 999 --n-cpu-moe 41

Mixture-of-experts models like Qwen 3.6 activate only ~3B parameters per token (from 256 experts, 8 wake up per token). The bulk of weights sit idle most of the time — dead weight on GPU but cheap in RAM.

Set --n-gpu-layers 999 with --n-cpu-moe 41 to put all layers on GPU, then pin MoE experts to CPU. Per token, GPU does its job, then requests whichever 8 experts are needed.

Result: ~10 tokens/sec. 230% faster, no hardware change.

Trick 2: No Memory-Map

Flag: --no-mmap

By default, llama.cpp uses mmap — pretends the model is in RAM but pages from disk on demand. Sounds smart, but every few tokens an unloaded expert triggers disk I/O.

With --no-mmap, llama.cpp loads the entire 20GB model into RAM upfront. Every expert is already present — no page faults mid-token.

Result: ~13.5 tokens/sec. 35% bump from one flag.

Trick 3: More GPU Layers

Flag: --n-cpu-moe 35 (changed from 41)

This pulls 6 layers of experts from CPU back to GPU.

Result: ~17 tokens/sec. VRAM usage goes from 4GB to 5.5GB.

Trade-off: Context window drops from 100K to ~64K tokens. Fine for most chats, not ideal for whole code base ingestion.

Trick 4: Turbo Quant KV Cache

Flags: --cache-type-k turbo4 --cache-type-v turbo3 --ctx-size 131072

Google DeepMind’s Turbo Quantpaper introduces asymmetric KV cache quantization: 4-bit keys, 3-bit values. This works because the model uses grouped query attention with an 8

ratio, so keys can take heavier compression than values.

With Q8 quantization on the cache (the default), context is limited. Pushing to Q4/Q3 frees massive VRAM.

Results:

Context bounces from 64K → 128K with turbo quant (--ctx-size 131072)
Push further: --ctx-size 262144 --n-cpu-moe 36 with turbo quant frees enough VRAM
Context stretches to 256,000 tokens — four times the model’s training context
Context stretches to 256,000 tokens — four times the model’s training context
Speed stays at 17 tokens/sec — KV cache quantization doesn’t slow inference

Practical use cases:

Paste a small book and ask questions about it
Drop an entire code base as context without the model “forgetting” page 1 by page 50

Trick 5: Memory Locking (Production Stability)

Flag: --mlock (plus Docker/LXC configuration)

Without this, the kernel treats model experts in RAM as regular files. Hours later when memory gets tight or the system idles, it starts paging experts to disk. Next inference triggers page faults, causing random slow tokens.

Requires three-layer configuration:

LXC container: Add mmlock.enabled: true
Docker: Add --cap-add IPC_LOCK and --ulimit memlock=-1:-1
llama.cpp: Add --mlock flag

With all three in place, mlocked shows 16GB — every expert glued in place. The setup runs for a week without degrading.

Note: Speed doesn’t change — this is about production stability, not performance.

Failed Trick: Speculative Decoding

What it is: Run a tiny model alongside the main model. The tiny model guesses the next 8 tokens, the main model verifies them in a single batch instead of 8 serial passes. Theoretical speedup: 2-4x.

What was tried: Qwen 3.5 800M as the drafter with the same tokenizer as the target.

Result:

Acceptance rate: ~65% (2 out of 3 guesses landed)
Speed: dropped from 17 to 11 tokens/sec

Why it failed:

MoE architectural issue: Each token in a batch picks its own 8 experts from 256. Eight tokens batched together can pull from 64 different experts per layer. Verification stops being a batch and turns into memory thrashing across PCIe.
SSM layers: Qwen 3.6 uses state space layers (30 of 40 layers are SSM). SSM computes one position at a time — each step depends on the state before it. You can’t parallelize across a draft window.

The verification in one pass trick doesn’t apply. Per-token verification time stays the same when you batch because expert loading dominates. Net negative.

Benchmarked elsewhere: Someone tested this on an RTX 3090 across 19 configurations — same result. Speculative decoding works for transformers, not for what we’re running.

Upcoming Trick: DFlash (Block Diffusion Drafter)

A follow-up paper worth trying: DFlash (block diffusion drafter). Generates eight tokens in one shot instead of one at a time.

There’s a working drafter for Qwen 3.6’s 27 billion dense version. Different model, same trick — potentially cracking 25 tokens/sec on the same 1060.

Worth coming back to.

Complete Command

1
docker run -it --gpus all \
2
  --cap-add IPC_LOCK --ulimit memlock=-1:-1 \
3
  -v ~/.cache/huggingface:/root/.cache/huggingface \
4
  ghcr.io/ggerganov/llama.cpp:latest \
5
  ./main -m Qwen3-35B-A3B-Q4_K_M.gguf \
6
  -n 256 \
7
  -c 262144 \
8
  --ctx-size 262144 \
9
  --n-gpu-layers 999 \
10
  --n-cpu-moe 36 \
11
  --no-mmap \
12
  --cache-type-k turbo4 \
13
  --cache-type-v turbo3 \
14
  --mlock \
15
  -p "Write a hello world in Python"

Final stats: 35B parameters, 6GB VRAM, 256K context, 17 tokens/sec on a GTX 1060.

Performance Summary

Trick	Flag(s)	Tokens/sec	Context
Baseline (problem)	`--n-gpu-layers 20`	3	100K
Trick 1: MoE to CPU	`--n-gpu-layers 999 --n-cpu-moe 41`	10	100K
Trick 2: No mmap	`--no-mmap`	13.5	100K
Trick 3: More GPU	`--n-cpu-moe 35`	17	64K
Trick 4: Turbo Quant	`--cache-type-k turbo4 --cache-type-v turbo3 --ctx-size 131072`	17	256K
Trick 5: + mlock	`--mlock`	17	256K (stable)

References

Can you run a 35-billion-parameter AI on an 8-year-old GPU? — https://www.youtube.com/watch?v=8F_5pdcD3HY

This article was written by opencode (minimax-m2.5-free | opencode), based on content from: https://www.youtube.com/watch?v=8F_5pdcD3HY