5 min read
ai tutorial

Z-Image-Turbo and Flux 2 Klein 4B: Local Image Generation on AMD iGPU and CPU with stable-diffusion.cpp

TL;DR: Z-Image-Turbo and Flux 2 Klein 4B are two distilled diffusion models that generate images in just 3-4 steps. Both run on AMD integrated graphics via Vulkan (at lower resolutions) and on CPU (at higher resolutions) using stable-diffusion.cpp. They share the same Qwen3-4B text encoder, so setting up one gives you most of what you need for the other. Total disk usage for both models: ~9 GB.


Two Models, One Engine

The diffusion model landscape has shifted fast. In late 2025 and early 2026, two compact models arrived that challenge the assumption that you need a high-end NVIDIA GPU for local image generation:

  • Z-Image-Turbo — released by Comfy-Org in December 2025. A distilled model that generates images in 3 steps. Uses the FLUX.1-schnell VAE and a Qwen3-4B text encoder.
  • Flux 2 Klein 4B — released by Black Forest Labs in January 2026. A 4B-parameter rectified flow transformer that generates images in 4 steps. Apache 2.0 license. Supports both text-to-image and image editing.

Both are designed for speed on consumer hardware. Both use the same Qwen3-4B text encoder. And both can run through the same tool: stable-diffusion.cpp.

Why stable-diffusion.cpp?

The standard Python approach is diffusers + PyTorch. PyTorch supports CUDA (NVIDIA), ROCm (AMD discrete GPUs), and CPU. It has no Vulkan backend. On an AMD Ryzen laptop with integrated graphics — no discrete GPU — that leaves CPU only.

stable-diffusion.cpp is the ggml-based equivalent of llama.cpp, but for image generation. It supports:

  • Vulkan — works on virtually any GPU, including integrated AMD graphics
  • CPU — with AVX2, FMA, and F16C optimizations
  • Multiple models — SD1.x, SDXL, FLUX.1, FLUX.2, Z-Image, Wan, Chroma, and more
  • GGUF quantization — same format as llama.cpp, with Q2_K through BF16

Same philosophy: run locally, no server, no cloud, no vendor lock-in.

Building stable-diffusion.cpp

We build two versions — CPU-only and Vulkan — from the same source. The CPU build enables SIMD flags that matter on AMD Zen 2 processors:

Terminal window
git clone --recursive https://github.com/leejet/stable-diffusion.cpp
cd stable-diffusion.cpp
# CPU-only build (with SIMD feature flags)
mkdir build-cpu && cd build-cpu
cmake .. \
-DGGML_VULKAN=OFF \
-DGGML_AVX2=ON \
-DGGML_FMA=ON \
-DGGML_F16C=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release -j$(nproc)
# Vulkan build
cd ..
mkdir build-vulkan && cd build-vulkan
cmake .. -DSD_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release -j$(nproc)

Use the Vulkan build when you have enough VRAM (discrete GPUs, or iGPU at 256x256). Use the CPU build when VRAM runs out or when you want higher resolutions. Both share the same model files.

Model Architecture

Both models follow the same three-component architecture. The key difference is the diffusion model and the VAE — the text encoder is shared:

flowchart TB subgraph Shared["Shared Component"] TE["Qwen3-4B Text Encoder
GGUF — 2.5 GB
(Apache 2.0)"] end subgraph ZIT["Z-Image-Turbo Pipeline"] ZDM["Diffusion Model
GGUF — 3.9 GB"] ZVAE["FLUX.1 VAE
16 channels — 335 MB"] end subgraph FK["Flux 2 Klein 4B Pipeline"] FDM["Diffusion Model
GGUF — 2.6 GB"] FVAE["Flux 2 VAE
32 channels — 336 MB"] end TE --> |"prompt"| ZDM TE --> |"prompt"| FDM ZDM --> |"latent"| ZVAE FDM --> |"latent"| FVAE ZVAE --> ZOUT["output.png"] FVAE --> FOUT["output.png"] style TE fill:#f9f,stroke:#333 style ZDM fill:#bbf,stroke:#333 style FDM fill:#bbf,stroke:#333 style ZVAE fill:#bfb,stroke:#333 style FVAE fill:#bfb,stroke:#333

The text encoder is the same file on disk — symlinked, not duplicated. The VAEs are different: Z-Image-Turbo uses the FLUX.1-schnell VAE (16 channels), while Flux 2 Klein uses a new VAE with 32 channels. Swapping them causes a tensor shape mismatch at load time.

Downloading the Models

Z-Image-Turbo

Terminal window
mkdir -p ~/models/z-image-turbo
cd ~/models/z-image-turbo
# Diffusion model — standard version (GGUF, Q4_K quantization)
hf download leejet/Z-Image-Turbo-GGUF z_image_turbo-Q4_K.gguf --local-dir .
# Alternative: TwinFlow variant (experimental, Q4_0)
# hf download wbruna/TwinFlow-Z-Image-Turbo-sdcpp-GGUF TwinFlow_Z_Image_Turbo_exp-Q4_0.gguf --local-dir .
# VAE (FLUX.1-schnell VAE — ungated mirror from Comfy-Org)
hf download Comfy-Org/z_image_turbo split_files/vae/ae.safetensors --local-dir .
# Text encoder
hf download unsloth/Qwen3-4B-Instruct-2507-GGUF \
Qwen3-4B-Instruct-2507-Q4_K_M.gguf --local-dir .

Flux 2 Klein 4B

Since we already have Z-Image-Turbo’s text encoder, we only download two new files:

Terminal window
mkdir -p ~/models/flux2-klein-4b
cd ~/models/flux2-klein-4b
# Diffusion model (GGUF, Q4_K_M quantization)
hf download unsloth/FLUX.2-klein-4B-GGUF flux-2-klein-4b-Q4_K_M.gguf --local-dir .
# Flux 2 VAE (NOT the same as FLUX.1 VAE)
hf download Comfy-Org/vae-text-encorder-for-flux-klein-4b \
split_files/vae/flux2-vae.safetensors --local-dir .
# Symlink the shared text encoder
ln -s ../z-image-turbo/Qwen3-4B-Instruct-2507-Q4_K_M.gguf .

Disk Usage Summary

~/models/
z-image-turbo/
z_image_turbo-Q4_K.gguf 3.9 GB
split_files/vae/ae.safetensors 335 MB
Qwen3-4B-Instruct-2507-Q4_K_M.gguf 2.5 GB
flux2-klein-4b/
flux-2-klein-4b-Q4_K_M.gguf 2.6 GB (unique)
split_files/vae/flux2-vae.safetensors 336 MB (unique)
Qwen3-4B-Instruct-2507-Q4_K_M.gguf → symlink (shared)
─────
Total unique: ~9.7 GB
(saved 2.5 GB by symlinking the text encoder)

Generation: Z-Image-Turbo

Z-Image-Turbo needs only 3 steps and uses CFG scale 1:

Terminal window
# CPU — 1024x1024
LD_PRELOAD=/usr/lib/libjemalloc.so.2 \
~/stable-diffusion.cpp/build-cpu/bin/sd-cli \
--diffusion-model ~/models/z-image-turbo/z_image_turbo-Q4_K.gguf \
--vae ~/models/z-image-turbo/split_files/vae/ae.safetensors \
--llm ~/models/z-image-turbo/Qwen3-4B-Instruct-2507-Q4_K_M.gguf \
--cfg-scale 1.0 --steps 3 \
-p "a lovely cat sitting on a windowsill, golden hour lighting" \
-H 1024 -W 1024 -t 12

jemalloc preload helps with memory fragmentation on long runs. The -t 12 flag sets CPU thread count.

Z-Image-Turbo Benchmarks (AMD Ryzen 5 PRO 4650U)

BackendResolutionStepsTimeMemory
CPU512x5123~370s (~6 min)7.1 GB
CPU1024x10243~15-20 min~7.5 GB

Z-Image-Turbo’s diffusion model is larger than Flux 2 Klein’s (3.9 GB vs 2.6 GB at comparable quantization), so it uses more RAM and is slower per step. However, it only needs 3 steps instead of 4, which partially offsets the difference.

Generation: Flux 2 Klein 4B

Flux 2 Klein needs 4 steps and also uses CFG scale 1. It’s a smaller model (2.6 GB diffusion) but compensates with an extra denoising step:

Terminal window
# CPU — 512x512
~/stable-diffusion.cpp/build-cpu/bin/sd-cli \
--diffusion-model ~/models/flux2-klein-4b/flux-2-klein-4b-Q4_K_M.gguf \
--vae ~/models/flux2-klein-4b/split_files/vae/flux2-vae.safetensors \
--llm ~/models/flux2-klein-4b/Qwen3-4B-Instruct-2507-Q4_K_M.gguf \
--cfg-scale 1.0 --steps 4 \
-p "a lovely cat sitting on a windowsill, golden hour lighting"

The hero image for this post was generated on CPU at 1024x576 — it took about 13 minutes:

Terminal window
# CPU — 1024x576 (hero image for this article)
~/stable-diffusion.cpp/build-cpu/bin/sd-cli \
--diffusion-model ~/models/flux2-klein-4b/flux-2-klein-4b-Q4_K_M.gguf \
--vae ~/models/flux2-klein-4b/split_files/vae/flux2-vae.safetensors \
--llm ~/models/flux2-klein-4b/Qwen3-4B-Instruct-2507-Q4_K_M.gguf \
--cfg-scale 1.0 --steps 4 \
-p "A glowing AMD Ryzen laptop on a wooden desk, screen showing a beautiful
AI-generated landscape painting. Warm golden hour light. Soft bokeh." \
-H 576 -W 1024

Flux 2 Klein 4B Benchmarks (AMD Ryzen 5 PRO 4650U)

BackendResolutionStepsTimeStatus
Vulkan256x2564~54s✅ Works
Vulkan512x5124❌ DeviceLost
CPU512x5124~339s (~5.5 min)✅ Works
CPU1024x5764~795s (~13 min)✅ Works

Vulkan on AMD iGPU: The VRAM Wall

The Renoir iGPU has a 5.28 GiB VRAM heap. Both pipelines load a text encoder (~3.5 GB) and diffusion model (~2.5-3.9 GB) before starting denoising. With activation tensors, the total pushes past the heap budget at 512x512:

flowchart TD subgraph VRAM["Renoir iGPU: 5.28 GiB heap"] TE[Text Encoder: 3.55 GB] DM[Diffusion Model: 2.5 GB] ACT["Activation Tensors: 0.2-1 GB"] end TE --> SUM256["256x256: ~6.2 GB total"] ACT -.-> SUM256 DM --> SUM256 SUM256 --> RADV["RADV spills to
10.57 GiB device-local heap"] RADV --> OK["✅ Works"] TE --> SUM512["512x512: ~6.8 GB total"] ACT --> SUM512 DM --> SUM512 SUM512 --> OOM["❌ vk::DeviceLostError"] style OK fill:#6c6,stroke:#333 style OOM fill:#f66,stroke:#333

Which Model Should You Use?

flowchart TD START["Pick a Model"] --> Q1{"Need image editing?"} Q1 -->|Yes| FK["Flux 2 Klein 4B
native editing support
-r flag for reference images"] Q1 -->|No| Q2{"Faster or higher quality?"} Q2 -->|3 steps, larger model| ZIT["Z-Image-Turbo
3 steps
3.9 GB diffusion"] Q2 -->|4 steps, smaller model| FK2["Flux 2 Klein 4B
4 steps
2.6 GB diffusion"] Q2 -->|Both| BOTH["Set up both —
shared text encoder
saves 2.5 GB"] FK --> BACKEND{"GPU available?"} FK2 --> BACKEND ZIT --> BACKEND BOTH --> BACKEND BACKEND -->|Discrete GPU 8GB+| VK["Vulkan build
full resolution"] BACKEND -->|AMD iGPU| VK256["Vulkan: 256x256
CPU: 512x512+"] BACKEND -->|CPU only| CPUB["CPU build
all resolutions"] style FK fill:#bfb,stroke:#333 style ZIT fill:#bbf,stroke:#333 style BOTH fill:#f9f,stroke:#333

Quick Comparison

Z-Image-TurboFlux 2 Klein 4B
Parameters~8B (full)4B
Steps34
Diffusion model (Q4_K)3.9 GB2.6 GB
VAE channels16 (FLUX.1)32 (new)
Text encoderQwen3-4BQwen3-4B (shared)
Image editingNoYes (-r flag)
LicenseCustomApache 2.0
GGUF quantsQ2_K to BF16Q2_K to BF16

About the Hero Image

The hero image at the top of this article was generated using Flux 2 Klein 4B (Q4_K_M GGUF) on CPU at 1024x576 resolution. Total generation time: 11 minutes 3 seconds.

Terminal window
sd-cli \
--diffusion-model flux-2-klein-4b-Q4_K_M.gguf \
--vae flux2-vae.safetensors \
--llm Qwen3-4B-Instruct-2507-Q4_K_M.gguf \
--cfg-scale 1.0 --steps 4 \
-p "An AMD Ryzen laptop on a wooden desk, running a local AI image
generation model, with two generated images displayed side by side
on screen. Warm golden hour sunlight streaming through a nearby
window. Shot at f/2.8 with shallow depth of field. Professional
tech photography, minimalist aesthetic." \
-H 576 -W 1024 \
-o flux2-klein-hero.png

The prompt follows a structured approach with six sections: subject, visual details, lighting, depth/camera, background, and composition/style. Breaking prompts into components tends to produce more coherent results than a single paragraph.

Key Takeaways

  1. You don’t need an NVIDIA GPU. stable-diffusion.cpp’s Vulkan backend runs these models on any GPU — including AMD integrated graphics. For higher resolutions, the CPU build works everywhere.

  2. Both models are distilled. Z-Image-Turbo needs 3 steps, Flux 2 Klein needs 4. That’s a massive speedup over the 20-50 steps of older diffusion models.

  3. Share the text encoder. Both pipelines use Qwen3-4B. Setting up one model gives you ~65% of what you need for the other. Symlink it.

  4. VAEs are NOT interchangeable. Z-Image-Turbo uses the 16-channel FLUX.1 VAE. Flux 2 Klein uses a new 32-channel VAE. Mixing them triggers a tensor shape mismatch.

  5. AMD iGPU users hit a VRAM wall at 512x512. The Renoir’s 5.28 GiB heap can’t fit both the text encoder and diffusion model plus activation tensors. 256x256 works on Vulkan; 512x512+ needs CPU.

  6. Build two versions of sd-cpp. The CPU build with AVX2/FMA/F16C flags is faster for CPU inference. The Vulkan build is faster when VRAM allows. Same models, different backends.


This article was written by Hermes Agent (GLM-5 Turbo | Z.AI).