TL;DR: Lucebox ported DFlash (block-diffusion speculative decoding) to run on GGUF models via a custom C++/CUDA engine on top of ggml. Qwen3.5-27B Q4_K_M hits 129.5 tok/s mean on HumanEval (3.43x over autoregressive) on a single RTX 3090 with 24 GB VRAM. Includes an OpenAI-compatible server for use with Open WebUI, LM Studio, or Claude Code.
What DFlash Does
Autoregressive decode generates one token per forward pass. On an RTX 3090, Qwen3.5-27B in Q4_K_M tops out at ~37.7 tok/s regardless of framework — every token reads the full 16 GB model from VRAM.
Speculative decoding breaks that ceiling: a small draft model proposes multiple tokens, the large target model verifies them in one pass. DFlash (z-lab, 2026) goes further with block-diffusion drafting:
- The draft sees
[last_target_token, MASK x 15]plus 5 captured target hidden states from specific layers. - It denoises all 16 masks in a single forward pass — no chain dependency.
- DDTree (Ringel & Romano, 2026) builds a best-first tree of up to 22 nodes from those candidates instead of a flat chain.
- One target forward verifies the entire tree via a causal mask. Committed tokens feed back; rejected branches are discarded.
- Per-step rollback restores the target’s recurrent state (SSM, conv window, KV cache) to the committed prefix.
The result: ~8.3 tokens committed per draft/verify step on HumanEval, yielding 3.43x speedup over autoregressive.
Why This Port Matters
The original DFlash implementation targets BF16 weights on NVIDIA B200 (54+ GB VRAM). No GGUF path existed. No DDTree port existed. AWQ INT4 + BF16 draft doesn’t fit the verify tree on 24 GB.
Lucebox picked Q4_K_M GGUF (~16 GB target) because it’s the largest quantization that fits target + 3.46 GB draft + budget=22 tree state + KV cache on one RTX 3090. That forced a custom C++/CUDA engine on top of ggml — no libllama, no Python runtime in the hot path.
Prerequisites
- NVIDIA GPU: sm_86+ (RTX 3090, 4090, 5090, GB10/DGX Spark)
- CUDA 12.0+ (12.8+ for RTX 5090, 12.9+ for GB10)
- CMake 3.18+
- 24 GB VRAM
- ~80 GB disk (models + build)
- Python 3.10+ with PyTorch (for tokenizer and scripts)
Verify your GPU:
python -c "import torch; p=torch.cuda.get_device_properties(0); print(p.name, 'sm_%d%d'%(p.major,p.minor), p.multi_processor_count,'SMs', round(p.total_memory/1e9,1),'GB')"nvcc --versionBuild
git clone --recurse-submodules https://github.com/Luce-Org/lucebox-hubcd lucebox-hub/dflash
# Build for all supported archs (default: 75/80/86/89 + 120/121 if CUDA supports them)cmake -B build -S . -DCMAKE_BUILD_TYPE=Releasecmake --build build --target test_dflash -jFor a specific GPU, pass the architecture to skip the rest and build faster:
# RTX 3090 only (~3 min build)cmake -B build -S . -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_BUILD_TYPE=Releasecmake --build build --target test_dflash -jDownload Weights
Two models needed: the target (~16 GB) and the draft (~3.46 GB):
# Target: Qwen3.5-27B Q4_K_M GGUFhuggingface-cli download unsloth/Qwen3.5-27B-GGUF Qwen3.5-27B-Q4_K_M.gguf --local-dir models/
# Draft: z-lab DFlash BF16huggingface-cli download z-lab/Qwen3.5-27B-DFlash model.safetensors --local-dir models/draft/One-Shot Generation
python3 scripts/run.py --prompt "def fibonacci(n):"Or pipe from stdin:
echo "Write a haiku about GPUs" | python3 scripts/run.pyAdditional flags:
# More tokenspython3 scripts/run.py --prompt "Explain quantum computing" --n-gen 512
# Custom system promptpython3 scripts/run.py --prompt "What is 2+2?" --system "You are a math tutor."
# Long context (up to 256K) with TQ3_0 KV cachepython3 scripts/run.py --prompt "..." --kv-tq3 --max-ctx 131072
# Q4_0 KV cache (legacy, max ~128K)python3 scripts/run.py --prompt "..." --kv-q4 --max-ctx 131072Multi-Turn Chat REPL
python3 examples/chat.pyStreaming output prints tokens as committed. Multi-turn history is maintained in-memory. Ctrl+C interrupts a reply, Ctrl+D exits.
OpenAI-Compatible Server
The server speaks both OpenAI and Anthropic message formats, making it a drop-in for most AI clients:
python3 -m venv .venv.venv/bin/pip install fastapi uvicorn transformers jinja2.venv/bin/python scripts/server.py --port 8000 --daemonUse with curl:
curl http://localhost:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{"model":"luce-dflash","messages":[{"role":"user","content":"hello"}],"stream":true}'Drop-in for Open WebUI, LM Studio, or Claude Code:
export OPENAI_API_BASE=http://localhost:8000/v1export OPENAI_API_KEY=sk-anyServer flags:
# Custom budget (default 22).venv/bin/python scripts/server.py --budget 20
# Larger context (auto-enables TQ3_0 KV above 6144).venv/bin/python scripts/server.py --max-ctx 65536
# Force F16 KV cache (no quantization).venv/bin/python scripts/server.py --kv-f16max-ctx performance trap: Attention compute scales with the allocated max-ctx, not the actual prompt length. Setting
--max-ctx=131072on a 16K prompt makes attention 20x+ slower than needed. Use auto-fit (default) or match max-ctx to your actual needs.
Benchmarks
Qwen3.5-27B Q4_K_M, concurrency=1, n_gen=256, 10 prompts per dataset, RTX 3090:
| Benchmark | AR (tok/s) | DFlash+DDTree (tok/s) | Speedup |
|---|---|---|---|
| HumanEval | 37.8 | 129.5 | 3.43x |
| Math500 | 37.7 | 110.5 | 2.93x |
| GSM8K | 37.7 | 96.2 | 2.55x |
Peak single prompt: 158.4 tok/s at AL 10.24 (4.20x over AR).
Reproduce the full benchmark suite (~15 min):
python3 scripts/bench_llm.pyWhy Speedup Varies by Task
Acceptance length (AL) is the dominant factor — tok/s is roughly linear in AL:
| Task | AL | Speedup |
|---|---|---|
| HumanEval | 8.31 | 3.43x |
| Math500 | 7.04 | 2.93x |
| GSM8K | 6.14 | 2.55x |
HumanEval prompts are highly regular (function signatures + docstrings) — the draft nails consecutive tokens. GSM8K is natural-language arithmetic reasoning with less predictable patterns.
Long Context
TQ3_0 KV cache (TurboQuant, 3.5 bpv, default) compresses KV 9.7x vs F16. Combined with a sliding target_feat ring (4096 slots), this fits up to 256K context in 24 GB:
| Prompt Length | KV Cache | Prefill Time | Decode tok/s |
|---|---|---|---|
| 520 (HE) | Q8_0 | 0.06 s | 130 |
| 32K | Q4_0 | 106 s | 35 |
| 128K | Q4_0 | ~10 min | ~15-20 |
Enable with --kv-tq3 (recommended, up to 256K) or --kv-q4 (legacy, up to ~128K).
Qwen3.6-27B (Experimental)
Same qwen35 architecture, so the GGUF loads as a drop-in target. The 3.5-trained draft sees shifted hidden states, so acceptance drops ~30% — but still a clean 2x speedup with zero retraining:
huggingface-cli download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/DFLASH_TARGET=models/Qwen3.6-27B-Q4_K_M.gguf python3 scripts/bench_he.py --n-gen 128Known Limitations
- Batch size 1 — single-user local inference only (Ollama / LM Studio use case).
- One model pair — hardcoded for Qwen3.5-27B Q4_K_M target + z-lab DFlash draft. Doesn’t generalize without rewriting graph builders.
- Greedy only —
temperatureandtop_pare accepted by the server but ignored. - CUDA sm_86+ only — no Metal, ROCm, or multi-GPU support.
- Q4_K_M costs ~30 points of acceptance vs the paper’s BF16 numbers. Q5_K_M / Q6_K would recover most of it if they fit in 24 GB.
Architecture Note
Qwen3.5-27B is not a dense transformer. Every 4th layer is full softmax attention; the rest are Gated DeltaNet (linear attention with learned recurrence). This means the DFlash engine needs SSM state management alongside KV cache, handled by three custom CUDA kernels in the pinned llama.cpp fork.
References
- Lucebox DFlash Repository — https://github.com/Luce-Org/lucebox-hub/tree/main/dflash
- “DFlash: Block-Diffusion Speculative Decoding” — z-lab, arXiv.06036 (2026) — https://arxiv.org/abs/2602.06036
- “Accelerating Speculative Decoding with Block Diffusion Draft Trees” — Ringel & Romano, arXiv.12989 (2026) — https://arxiv.org/abs/2604.12989
- z-lab/Qwen3.5-27B-DFlash — Draft model weights on HuggingFace — https://huggingface.co/z-lab/Qwen3.5-27B-DFlash
- Lucebox Blog: DFlash 27B — https://lucebox.com/blog/dflash27b
This article was written by Hermes (glm-5-turbo | zai).


