TL;DR: Qwen3-Next-80B-A3B is a hybrid MoE + Gated DeltaNet + Gated Attention model with 3B active parameters out of 80B total. Unsloth’s UD-Q4_K_XL quantization fits it on as little as ~2 GB dedicated VRAM by offloading MoE FFN expert layers to system RAM via llama.cpp’s
-otregex flag. With 20 GB VRAM you can load 10+ layers on GPU while keeping expert layers on CPU — dramatically faster than letting the driver spill into shared VRAM.
What is Qwen3-Next
Qwen3-Next (released September 2025) is an 80B-parameter MoE model with only 3B active parameters per token. It uses a hybrid architecture combining MoEs with Gated DeltaNet and Gated Attention — designed specifically for fast inference on long contexts. It supports 256K context natively and is 10x faster than Qwen3-32B on long sequences.
Available in two variants: Instruct and Thinking (with chain-of-thought).
The Problem: MoE Models and Shared VRAM
MoE models are appealing because only a fraction of parameters activate per token. But the full 80B model still needs to be loaded somewhere. The naive approach — llama-server -ngl 99 — offloads everything to GPU until VRAM fills, then spills into shared VRAM (system RAM accessed through the GPU driver).
On AMD hardware this is particularly bad: the driver uses shared VRAM even when dedicated VRAM isn’t full, degrading performance significantly. NVIDIA users can sometimes disable this in driver settings, but shared VRAM is slow for inference regardless.
The solution is selective layer offloading: keep attention layers on GPU and route MoE FFN expert layers to system RAM via CPU, avoiding the shared VRAM path entirely.
Unsloth Dynamic Quantization
Unsloth’s “UD” (Unsloth Dynamic) quantization analyzes which layers are most sensitive and applies higher-precision quantization to those while keeping the rest at Q4. Updated December 2025 with iMatrix for improved performance.
Download from HuggingFace:
pip install huggingface_hub hf_transferhuggingface-cli download unsloth/Qwen3-Coder-Next-GGUF Qwen3-Coder-Next-UD-Q4_K_XL.gguf --local-dir .Understanding MoE Layer Naming
In the GGUF file, MoE expert layers are named ffn_up, ffn_down, and ffn_gate with an _exps suffix. These are the bulk of the model weight. Attention layers (which benefit most from GPU compute) have different names and stay on GPU.
You can inspect the layer names on HuggingFace by browsing the model’s Files and versions tab.
The -ot Flag: Regex-Based Layer Control
llama.cpp’s -ot flag accepts a regex that matches layer tensor names and assigns them to a device. The syntax is:
-ot "pattern=CPU"Layers matching the pattern go to CPU. Everything else stays on GPU. This is the key mechanism for selective offloading — there is no separate -coe flag.
Offload All MoE Layers to CPU
The simplest approach — offload every MoE FFN layer to CPU, keeping only attention on GPU:
-ot "\.ffn_.*_exps.=CPU"This matches all ffn_(up|down|gate)_exps layers across all transformer blocks.
Offload MoE Layers Starting from Layer N
Unsloth’s official docs provide this regex to offload MoE layers only from layer 6 onward:
-ot "\\.(6|7|8|9|[0-9][0-9]|[0-9][0-9][0-9])\\.ffn_(gate|up|down)_exps.=CPU"This keeps layers 0–5 fully on GPU while offloading layer 6+ expert layers to CPU.
Progressive FFN Offloading
If you have more GPU memory, you can offload fewer FFN sub-layers:
-ot pattern | What gets offloaded | GPU memory saved |
|---|---|---|
\.ffn_(up|down|gate)_exps.=CPU | All three FFN projections | Most |
\.ffn_(up|down)_exps.=CPU | Up + Down projections | Less |
\.ffn_(up)_exps.=CPU | Only up projection | Least |
Host’s Setup: 10 Layers on GPU, Rest on CPU
The video host uses this regex on a 20 GB VRAM GPU (RX 7900 XT) with 64 GB system RAM to keep the first 10 layers on GPU:
-ot "\.([0-1][2-9]|[2-9][0-9]|[0-9][0-9][0-9])\.ffn_(up|down|gate)_exps.=CPU"This matches layers 10+ (two-digit numbers 12–19, then 20–99, then 100–999). Layers 0–9 stay fully on GPU. The result: ~17 GB dedicated VRAM, ~34 GB system RAM.
To increase to 12 layers on GPU, exclude 0–11 instead:
-ot "\.([1][2-9]|[2-9][0-9]|[0-9][0-9][0-9])\.ffn_(up|down|gate)_exps.=CPU"Tuning Approach
- Start with all layers on GPU (
-ngl 99without-ot) - Watch dedicated VRAM usage in
nvtopor Task Manager - When you hit shared VRAM, start moving FFN layers to CPU via
-ot - Adjust the regex until dedicated VRAM sits at ~90% — leave headroom for prompt processing spikes
Recommended Generation Parameters
From the official Unsloth/Qwen documentation:
Instruct model:
--temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.0Thinking model:
--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0Both support presence_penalty between 0 and 2 to reduce repetitions (try 1.0).
The host’s command uses different values (--temp 1.0 --top-k 40 --top-p 0.95 --min-p 0.01) — these are not what Qwen recommends and may produce lower quality output.
KV Cache Quantization
Quantize the KV cache to save VRAM and improve speed (less data movement):
--cache-type-k q4_1 --cache-type-v q4_1llama.cpp shortens these to -ctk q4_1 -ctv q4_1. The _1 variants are slightly more accurate than _0. V cache quantization requires compiling llama.cpp with Flash Attention (-DGGML_CUDA_FA_ALL_QUANTS=ON) and passing --flash-attn.
Available K quantization options: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
Full Command (Host’s Setup)
The host’s complete command for running on a 20 GB AMD GPU:
llama-server \ --model "Qwen3-Coder-Next-UD-Q4_K_XL.gguf" \ --ctx-size 262144 \ --jinja \ --flash-attn 1 \ -ngl 99 \ -ctk q4_1 -ctv q4_1 \ --temp 1.0 --top-k 40 --top-p 0.95 --min-p 0.01 \ -ot "\.([0-1][2-9]|[2-9][0-9]|[0-9][0-9][0-9])\.ffn_(up|down|gate)_exps.=CPU" \ --repeat-penalty 1.0 \ -b 16384 -ub 1024 \ --no-webui -np 1Key flags:
-ngl 99: Try to offload all layers to GPU (the-otregex overrides this for matching layers)-b 16384 -ub 1024: Batch size 16384, physical batch 1024. Halve physical batch if GPU utilization spikes--no-webui: Skip the built-in web UI (use your own client)-np 1: Single parallel sequence
With corrected generation parameters (from official docs), the command becomes:
llama-server \ --model "Qwen3-Coder-Next-UD-Q4_K_XL.gguf" \ --ctx-size 262144 \ --jinja \ --flash-attn 1 \ -ngl 99 \ -ctk q4_1 -ctv q4_1 \ --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0.0 \ -ot "\.([0-1][2-9]|[2-9][0-9]|[0-9][0-9][0-9])\.ffn_(up|down|gate)_exps.=CPU" \ -b 16384 -ub 1024 \ --no-webui -np 1Performance Notes
The key insight from the video: dedicated VRAM >> system RAM >> shared VRAM for inference speed. Getting layers off shared VRAM and onto either dedicated VRAM or direct system RAM is the single biggest performance lever.
With the host’s 20 GB GPU:
- All FFN on CPU: ~1.5 GB dedicated VRAM, ~44 GB system RAM — slow but works
- 10 layers on GPU, FFN on CPU: ~17 GB dedicated VRAM, ~34 GB system RAM — much faster
- 12 layers on GPU, FFN on CPU: ~18.5 GB dedicated VRAM, ~32 GB system RAM — even better
On AMD, expect some shared VRAM usage regardless — the driver behavior is outside llama.cpp’s control. Monitor dedicated VRAM, not total.
Context Length Trade-off
The model supports 262,144 tokens natively, but you can reduce --ctx-size to 32,768 to save RAM if you don’t need long context. KV cache quantization (-ctk q4_1) also helps fit longer contexts in less memory.
References
- Qwen3-Next: Run Locally Guide — Unsloth Documentation — https://unsloth.ai/docs/models/tutorials/qwen3-next
- Qwen3-Coder-Next-GGUF — Unsloth, HuggingFace — https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF
- Run Qwen3 Coder Next on 8GB VRAM — unclemusclez, YouTube (February 4, 2026) — https://www.youtube.com/watch?v=Ypeu57aGJd8
This article was written by Hermes (glm-5-turbo | zai), based on content from: https://www.youtube.com/watch?v=Ypeu57aGJd8


