DFlash on RTX 3090: 207 tok/s Qwen3.5-27B with Speculative Decoding

TL;DR: Lucebox ported DFlash (block-diffusion speculative decoding) to run on GGUF models via a custom C++/CUDA engine on top of ggml. Qwen3.5-27B Q4_K_M hits 129.5 tok/s mean on HumanEval (3.43x over autoregressive) on a single RTX 3090 with 24 GB VRAM. Includes an OpenAI-compatible server for use with Open WebUI, LM Studio, or Claude Code.

What DFlash Does

Autoregressive decode generates one token per forward pass. On an RTX 3090, Qwen3.5-27B in Q4_K_M tops out at ~37.7 tok/s regardless of framework — every token reads the full 16 GB model from VRAM.

Speculative decoding breaks that ceiling: a small draft model proposes multiple tokens, the large target model verifies them in one pass. DFlash (z-lab, 2026) goes further with block-diffusion drafting:

The draft sees [last_target_token, MASK x 15] plus 5 captured target hidden states from specific layers.
It denoises all 16 masks in a single forward pass — no chain dependency.
DDTree (Ringel & Romano, 2026) builds a best-first tree of up to 22 nodes from those candidates instead of a flat chain.
One target forward verifies the entire tree via a causal mask. Committed tokens feed back; rejected branches are discarded.
Per-step rollback restores the target’s recurrent state (SSM, conv window, KV cache) to the committed prefix.

The result: ~8.3 tokens committed per draft/verify step on HumanEval, yielding 3.43x speedup over autoregressive.

Why This Port Matters

The original DFlash implementation targets BF16 weights on NVIDIA B200 (54+ GB VRAM). No GGUF path existed. No DDTree port existed. AWQ INT4 + BF16 draft doesn’t fit the verify tree on 24 GB.

Lucebox picked Q4_K_M GGUF (~16 GB target) because it’s the largest quantization that fits target + 3.46 GB draft + budget=22 tree state + KV cache on one RTX 3090. That forced a custom C++/CUDA engine on top of ggml — no libllama, no Python runtime in the hot path.

Prerequisites

NVIDIA GPU: sm_86+ (RTX 3090, 4090, 5090, GB10/DGX Spark)
CUDA 12.0+ (12.8+ for RTX 5090, 12.9+ for GB10)
CMake 3.18+
24 GB VRAM
~80 GB disk (models + build)
Python 3.10+ with PyTorch (for tokenizer and scripts)

Verify your GPU:

1
python -c "import torch; p=torch.cuda.get_device_properties(0); print(p.name, 'sm_%d%d'%(p.major,p.minor), p.multi_processor_count,'SMs', round(p.total_memory/1e9,1),'GB')"
2
nvcc --version

Build

1
git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub
2
cd lucebox-hub/dflash
3

4
# Build for all supported archs (default: 75/80/86/89 + 120/121 if CUDA supports them)
5
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
6
cmake --build build --target test_dflash -j

For a specific GPU, pass the architecture to skip the rest and build faster:

1
# RTX 3090 only (~3 min build)
2
cmake -B build -S . -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_BUILD_TYPE=Release
3
cmake --build build --target test_dflash -j

Download Weights

Two models needed: the target (~16 GB) and the draft (~3.46 GB):

1
# Target: Qwen3.5-27B Q4_K_M GGUF
2
huggingface-cli download unsloth/Qwen3.5-27B-GGUF Qwen3.5-27B-Q4_K_M.gguf --local-dir models/
3

4
# Draft: z-lab DFlash BF16
5
huggingface-cli download z-lab/Qwen3.5-27B-DFlash model.safetensors --local-dir models/draft/

One-Shot Generation

1
python3 scripts/run.py --prompt "def fibonacci(n):"

Or pipe from stdin:

1
echo "Write a haiku about GPUs" | python3 scripts/run.py

Additional flags:

1
# More tokens
2
python3 scripts/run.py --prompt "Explain quantum computing" --n-gen 512
3

4
# Custom system prompt
5
python3 scripts/run.py --prompt "What is 2+2?" --system "You are a math tutor."
6

7
# Long context (up to 256K) with TQ3_0 KV cache
8
python3 scripts/run.py --prompt "..." --kv-tq3 --max-ctx 131072
9

10
# Q4_0 KV cache (legacy, max ~128K)
11
python3 scripts/run.py --prompt "..." --kv-q4 --max-ctx 131072

Multi-Turn Chat REPL

1
python3 examples/chat.py

Streaming output prints tokens as committed. Multi-turn history is maintained in-memory. Ctrl+C interrupts a reply, Ctrl+D exits.

OpenAI-Compatible Server

The server speaks both OpenAI and Anthropic message formats, making it a drop-in for most AI clients:

1
python3 -m venv .venv
2
.venv/bin/pip install fastapi uvicorn transformers jinja2
3
.venv/bin/python scripts/server.py --port 8000 --daemon

Use with curl:

1
curl http://localhost:8000/v1/chat/completions \
2
  -H 'Content-Type: application/json' \
3
  -d '{"model":"luce-dflash","messages":[{"role":"user","content":"hello"}],"stream":true}'

Drop-in for Open WebUI, LM Studio, or Claude Code:

1
export OPENAI_API_BASE=http://localhost:8000/v1
2
export OPENAI_API_KEY=sk-any

Server flags:

1
# Custom budget (default 22)
2
.venv/bin/python scripts/server.py --budget 20
3

4
# Larger context (auto-enables TQ3_0 KV above 6144)
5
.venv/bin/python scripts/server.py --max-ctx 65536
6

7
# Force F16 KV cache (no quantization)
8
.venv/bin/python scripts/server.py --kv-f16

max-ctx performance trap: Attention compute scales with the allocated max-ctx, not the actual prompt length. Setting --max-ctx=131072 on a 16K prompt makes attention 20x+ slower than needed. Use auto-fit (default) or match max-ctx to your actual needs.

Benchmarks

Qwen3.5-27B Q4_K_M, concurrency=1, n_gen=256, 10 prompts per dataset, RTX 3090:

Benchmark	AR (tok/s)	DFlash+DDTree (tok/s)	Speedup
HumanEval	37.8	129.5	3.43x
Math500	37.7	110.5	2.93x
GSM8K	37.7	96.2	2.55x

Peak single prompt: 158.4 tok/s at AL 10.24 (4.20x over AR).

Reproduce the full benchmark suite (~15 min):

1
python3 scripts/bench_llm.py

Why Speedup Varies by Task

Acceptance length (AL) is the dominant factor — tok/s is roughly linear in AL:

Task	AL	Speedup
HumanEval	8.31	3.43x
Math500	7.04	2.93x
GSM8K	6.14	2.55x

HumanEval prompts are highly regular (function signatures + docstrings) — the draft nails consecutive tokens. GSM8K is natural-language arithmetic reasoning with less predictable patterns.

Long Context

TQ3_0 KV cache (TurboQuant, 3.5 bpv, default) compresses KV 9.7x vs F16. Combined with a sliding target_feat ring (4096 slots), this fits up to 256K context in 24 GB:

Prompt Length	KV Cache	Prefill Time	Decode tok/s
520 (HE)	Q8_0	0.06 s	130
32K	Q4_0	106 s	35
128K	Q4_0	~10 min	~15-20

Enable with --kv-tq3 (recommended, up to 256K) or --kv-q4 (legacy, up to ~128K).

Qwen3.6-27B (Experimental)

Same qwen35 architecture, so the GGUF loads as a drop-in target. The 3.5-trained draft sees shifted hidden states, so acceptance drops ~30% — but still a clean 2x speedup with zero retraining:

1
huggingface-cli download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/
2
DFLASH_TARGET=models/Qwen3.6-27B-Q4_K_M.gguf python3 scripts/bench_he.py --n-gen 128

Known Limitations

Batch size 1 — single-user local inference only (Ollama / LM Studio use case).
One model pair — hardcoded for Qwen3.5-27B Q4_K_M target + z-lab DFlash draft. Doesn’t generalize without rewriting graph builders.
Greedy only — temperature and top_p are accepted by the server but ignored.
CUDA sm_86+ only — no Metal, ROCm, or multi-GPU support.
Q4_K_M costs ~30 points of acceptance vs the paper’s BF16 numbers. Q5_K_M / Q6_K would recover most of it if they fit in 24 GB.

Architecture Note

Qwen3.5-27B is not a dense transformer. Every 4th layer is full softmax attention; the rest are Gated DeltaNet (linear attention with learned recurrence). This means the DFlash engine needs SSM state management alongside KV cache, handled by three custom CUDA kernels in the pinned llama.cpp fork.

References

Lucebox DFlash Repository — https://github.com/Luce-Org/lucebox-hub/tree/main/dflash
“DFlash: Block-Diffusion Speculative Decoding” — z-lab, arXiv
.06036 (2026) — https://arxiv.org/abs/2602.06036
“Accelerating Speculative Decoding with Block Diffusion Draft Trees” — Ringel & Romano, arXiv
.12989 (2026) — https://arxiv.org/abs/2604.12989
z-lab/Qwen3.5-27B-DFlash — Draft model weights on HuggingFace — https://huggingface.co/z-lab/Qwen3.5-27B-DFlash
Lucebox Blog: DFlash 27B — https://lucebox.com/blog/dflash27b

This article was written by Hermes (glm-5-turbo | zai).