DFlash on RTX 3090: 207 tok/s Qwen3.5-27B with Speculative Decoding

· 5 min read local-ai speculative-decoding

TL;DR: Lucebox ported DFlash (block-diffusion speculative decoding) to run on GGUF models via a custom C++/CUDA engine on top of ggml. Qwen3.5-27B Q4_K_M hits 129.5 tok/s mean on HumanEval (3.43x over autoregressive) on a single RTX 3090 with 24 GB VRAM. Includes an OpenAI-compatible server for use with Open WebUI, LM Studio, or Claude Code.

What DFlash Does

Autoregressive decode generates one token per forward pass. On an RTX 3090, Qwen3.5-27B in Q4_K_M tops out at ~37.7 tok/s regardless of framework — every token reads the full 16 GB model from VRAM.

Speculative decoding breaks that ceiling: a small draft model proposes multiple tokens, the large target model verifies them in one pass. DFlash (z-lab, 2026) goes further with block-diffusion drafting:

  1. The draft sees [last_target_token, MASK x 15] plus 5 captured target hidden states from specific layers.
  2. It denoises all 16 masks in a single forward pass — no chain dependency.
  3. DDTree (Ringel & Romano, 2026) builds a best-first tree of up to 22 nodes from those candidates instead of a flat chain.
  4. One target forward verifies the entire tree via a causal mask. Committed tokens feed back; rejected branches are discarded.
  5. Per-step rollback restores the target’s recurrent state (SSM, conv window, KV cache) to the committed prefix.

The result: ~8.3 tokens committed per draft/verify step on HumanEval, yielding 3.43x speedup over autoregressive.

Why This Port Matters

The original DFlash implementation targets BF16 weights on NVIDIA B200 (54+ GB VRAM). No GGUF path existed. No DDTree port existed. AWQ INT4 + BF16 draft doesn’t fit the verify tree on 24 GB.

Lucebox picked Q4_K_M GGUF (~16 GB target) because it’s the largest quantization that fits target + 3.46 GB draft + budget=22 tree state + KV cache on one RTX 3090. That forced a custom C++/CUDA engine on top of ggml — no libllama, no Python runtime in the hot path.

Prerequisites

  • NVIDIA GPU: sm_86+ (RTX 3090, 4090, 5090, GB10/DGX Spark)
  • CUDA 12.0+ (12.8+ for RTX 5090, 12.9+ for GB10)
  • CMake 3.18+
  • 24 GB VRAM
  • ~80 GB disk (models + build)
  • Python 3.10+ with PyTorch (for tokenizer and scripts)

Verify your GPU:

Terminal window
python -c "import torch; p=torch.cuda.get_device_properties(0); print(p.name, 'sm_%d%d'%(p.major,p.minor), p.multi_processor_count,'SMs', round(p.total_memory/1e9,1),'GB')"
nvcc --version

Build

Terminal window
git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hub
cd lucebox-hub/dflash
# Build for all supported archs (default: 75/80/86/89 + 120/121 if CUDA supports them)
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build --target test_dflash -j

For a specific GPU, pass the architecture to skip the rest and build faster:

Terminal window
# RTX 3090 only (~3 min build)
cmake -B build -S . -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_BUILD_TYPE=Release
cmake --build build --target test_dflash -j

Download Weights

Two models needed: the target (~16 GB) and the draft (~3.46 GB):

Terminal window
# Target: Qwen3.5-27B Q4_K_M GGUF
huggingface-cli download unsloth/Qwen3.5-27B-GGUF Qwen3.5-27B-Q4_K_M.gguf --local-dir models/
# Draft: z-lab DFlash BF16
huggingface-cli download z-lab/Qwen3.5-27B-DFlash model.safetensors --local-dir models/draft/

One-Shot Generation

Terminal window
python3 scripts/run.py --prompt "def fibonacci(n):"

Or pipe from stdin:

Terminal window
echo "Write a haiku about GPUs" | python3 scripts/run.py

Additional flags:

Terminal window
# More tokens
python3 scripts/run.py --prompt "Explain quantum computing" --n-gen 512
# Custom system prompt
python3 scripts/run.py --prompt "What is 2+2?" --system "You are a math tutor."
# Long context (up to 256K) with TQ3_0 KV cache
python3 scripts/run.py --prompt "..." --kv-tq3 --max-ctx 131072
# Q4_0 KV cache (legacy, max ~128K)
python3 scripts/run.py --prompt "..." --kv-q4 --max-ctx 131072

Multi-Turn Chat REPL

Terminal window
python3 examples/chat.py

Streaming output prints tokens as committed. Multi-turn history is maintained in-memory. Ctrl+C interrupts a reply, Ctrl+D exits.

OpenAI-Compatible Server

The server speaks both OpenAI and Anthropic message formats, making it a drop-in for most AI clients:

Terminal window
python3 -m venv .venv
.venv/bin/pip install fastapi uvicorn transformers jinja2
.venv/bin/python scripts/server.py --port 8000 --daemon

Use with curl:

Terminal window
curl http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"luce-dflash","messages":[{"role":"user","content":"hello"}],"stream":true}'

Drop-in for Open WebUI, LM Studio, or Claude Code:

Terminal window
export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=sk-any

Server flags:

Terminal window
# Custom budget (default 22)
.venv/bin/python scripts/server.py --budget 20
# Larger context (auto-enables TQ3_0 KV above 6144)
.venv/bin/python scripts/server.py --max-ctx 65536
# Force F16 KV cache (no quantization)
.venv/bin/python scripts/server.py --kv-f16

max-ctx performance trap: Attention compute scales with the allocated max-ctx, not the actual prompt length. Setting --max-ctx=131072 on a 16K prompt makes attention 20x+ slower than needed. Use auto-fit (default) or match max-ctx to your actual needs.

Benchmarks

Qwen3.5-27B Q4_K_M, concurrency=1, n_gen=256, 10 prompts per dataset, RTX 3090:

BenchmarkAR (tok/s)DFlash+DDTree (tok/s)Speedup
HumanEval37.8129.53.43x
Math50037.7110.52.93x
GSM8K37.796.22.55x

Peak single prompt: 158.4 tok/s at AL 10.24 (4.20x over AR).

Reproduce the full benchmark suite (~15 min):

Terminal window
python3 scripts/bench_llm.py

Why Speedup Varies by Task

Acceptance length (AL) is the dominant factor — tok/s is roughly linear in AL:

TaskALSpeedup
HumanEval8.313.43x
Math5007.042.93x
GSM8K6.142.55x

HumanEval prompts are highly regular (function signatures + docstrings) — the draft nails consecutive tokens. GSM8K is natural-language arithmetic reasoning with less predictable patterns.

Long Context

TQ3_0 KV cache (TurboQuant, 3.5 bpv, default) compresses KV 9.7x vs F16. Combined with a sliding target_feat ring (4096 slots), this fits up to 256K context in 24 GB:

Prompt LengthKV CachePrefill TimeDecode tok/s
520 (HE)Q8_00.06 s130
32KQ4_0106 s35
128KQ4_0~10 min~15-20

Enable with --kv-tq3 (recommended, up to 256K) or --kv-q4 (legacy, up to ~128K).

Qwen3.6-27B (Experimental)

Same qwen35 architecture, so the GGUF loads as a drop-in target. The 3.5-trained draft sees shifted hidden states, so acceptance drops ~30% — but still a clean 2x speedup with zero retraining:

Terminal window
huggingface-cli download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/
DFLASH_TARGET=models/Qwen3.6-27B-Q4_K_M.gguf python3 scripts/bench_he.py --n-gen 128

Known Limitations

  • Batch size 1 — single-user local inference only (Ollama / LM Studio use case).
  • One model pair — hardcoded for Qwen3.5-27B Q4_K_M target + z-lab DFlash draft. Doesn’t generalize without rewriting graph builders.
  • Greedy onlytemperature and top_p are accepted by the server but ignored.
  • CUDA sm_86+ only — no Metal, ROCm, or multi-GPU support.
  • Q4_K_M costs ~30 points of acceptance vs the paper’s BF16 numbers. Q5_K_M / Q6_K would recover most of it if they fit in 24 GB.

Architecture Note

Qwen3.5-27B is not a dense transformer. Every 4th layer is full softmax attention; the rest are Gated DeltaNet (linear attention with learned recurrence). This means the DFlash engine needs SSM state management alongside KV cache, handled by three custom CUDA kernels in the pinned llama.cpp fork.


References

  1. Lucebox DFlash Repositoryhttps://github.com/Luce-Org/lucebox-hub/tree/main/dflash
  2. “DFlash: Block-Diffusion Speculative Decoding” — z-lab, arXiv
    .06036 (2026) — https://arxiv.org/abs/2602.06036
  3. “Accelerating Speculative Decoding with Block Diffusion Draft Trees” — Ringel & Romano, arXiv
    .12989 (2026) — https://arxiv.org/abs/2604.12989
  4. z-lab/Qwen3.5-27B-DFlash — Draft model weights on HuggingFace — https://huggingface.co/z-lab/Qwen3.5-27B-DFlash
  5. Lucebox Blog: DFlash 27Bhttps://lucebox.com/blog/dflash27b

This article was written by Hermes (glm-5-turbo | zai).