Z-Image-Turbo and Flux 2 Klein 4B: Local Image Generation on AMD iGPU and CPU with stable-diffusion.cpp

TL;DR: Z-Image-Turbo and Flux 2 Klein 4B are two distilled diffusion models that generate images in just 3-4 steps. Both run on AMD integrated graphics via Vulkan (at lower resolutions) and on CPU (at higher resolutions) using stable-diffusion.cpp. They share the same Qwen3-4B text encoder, so setting up one gives you most of what you need for the other. Total disk usage for both models: ~9 GB.

Two Models, One Engine

The diffusion model landscape has shifted fast. In late 2025 and early 2026, two compact models arrived that challenge the assumption that you need a high-end NVIDIA GPU for local image generation:

Z-Image-Turbo — released by Comfy-Org in December 2025. A distilled model that generates images in 3 steps. Uses the FLUX.1-schnell VAE and a Qwen3-4B text encoder.
Flux 2 Klein 4B — released by Black Forest Labs in January 2026. A 4B-parameter rectified flow transformer that generates images in 4 steps. Apache 2.0 license. Supports both text-to-image and image editing.

Both are designed for speed on consumer hardware. Both use the same Qwen3-4B text encoder. And both can run through the same tool: stable-diffusion.cpp.

Why stable-diffusion.cpp?

The standard Python approach is diffusers + PyTorch. PyTorch supports CUDA (NVIDIA), ROCm (AMD discrete GPUs), and CPU. It has no Vulkan backend. On an AMD Ryzen laptop with integrated graphics — no discrete GPU — that leaves CPU only.

stable-diffusion.cpp is the ggml-based equivalent of llama.cpp, but for image generation. It supports:

Vulkan — works on virtually any GPU, including integrated AMD graphics
CPU — with AVX2, FMA, and F16C optimizations
Multiple models — SD1.x, SDXL, FLUX.1, FLUX.2, Z-Image, Wan, Chroma, and more
GGUF quantization — same format as llama.cpp, with Q2_K through BF16

Same philosophy: run locally, no server, no cloud, no vendor lock-in.

Building stable-diffusion.cpp

We build two versions — CPU-only and Vulkan — from the same source. The CPU build enables SIMD flags that matter on AMD Zen 2 processors:

1
git clone --recursive https://github.com/leejet/stable-diffusion.cpp
2
cd stable-diffusion.cpp
3

4
# CPU-only build (with SIMD feature flags)
5
mkdir build-cpu && cd build-cpu
6
cmake .. \
7
  -DGGML_VULKAN=OFF \
8
  -DGGML_AVX2=ON \
9
  -DGGML_FMA=ON \
10
  -DGGML_F16C=ON \
11
  -DCMAKE_BUILD_TYPE=Release
12
cmake --build . --config Release -j$(nproc)
13

14
# Vulkan build
15
cd ..
16
mkdir build-vulkan && cd build-vulkan
17
cmake .. -DSD_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
18
cmake --build . --config Release -j$(nproc)

Use the Vulkan build when you have enough VRAM (discrete GPUs, or iGPU at 256x256). Use the CPU build when VRAM runs out or when you want higher resolutions. Both share the same model files.

Model Architecture

Both models follow the same three-component architecture. The key difference is the diffusion model and the VAE — the text encoder is shared:

flowchart TB subgraph Shared["Shared Component"] TE["Qwen3-4B Text Encoder
GGUF — 2.5 GB
(Apache 2.0)"] end subgraph ZIT["Z-Image-Turbo Pipeline"] ZDM["Diffusion Model
GGUF — 3.9 GB"] ZVAE["FLUX.1 VAE
16 channels — 335 MB"] end subgraph FK["Flux 2 Klein 4B Pipeline"] FDM["Diffusion Model
GGUF — 2.6 GB"] FVAE["Flux 2 VAE
32 channels — 336 MB"] end TE --> |"prompt"| ZDM TE --> |"prompt"| FDM ZDM --> |"latent"| ZVAE FDM --> |"latent"| FVAE ZVAE --> ZOUT["output.png"] FVAE --> FOUT["output.png"] style TE fill:#f9f,stroke:#333 style ZDM fill:#bbf,stroke:#333 style FDM fill:#bbf,stroke:#333 style ZVAE fill:#bfb,stroke:#333 style FVAE fill:#bfb,stroke:#333

The text encoder is the same file on disk — symlinked, not duplicated. The VAEs are different: Z-Image-Turbo uses the FLUX.1-schnell VAE (16 channels), while Flux 2 Klein uses a new VAE with 32 channels. Swapping them causes a tensor shape mismatch at load time.

Downloading the Models

Z-Image-Turbo

1
mkdir -p ~/models/z-image-turbo
2
cd ~/models/z-image-turbo
3

4
# Diffusion model — standard version (GGUF, Q4_K quantization)
5
hf download leejet/Z-Image-Turbo-GGUF z_image_turbo-Q4_K.gguf --local-dir .
6

7
# Alternative: TwinFlow variant (experimental, Q4_0)
8
# hf download wbruna/TwinFlow-Z-Image-Turbo-sdcpp-GGUF TwinFlow_Z_Image_Turbo_exp-Q4_0.gguf --local-dir .
9

10
# VAE (FLUX.1-schnell VAE — ungated mirror from Comfy-Org)
11
hf download Comfy-Org/z_image_turbo split_files/vae/ae.safetensors --local-dir .
12

13
# Text encoder
14
hf download unsloth/Qwen3-4B-Instruct-2507-GGUF \
15
  Qwen3-4B-Instruct-2507-Q4_K_M.gguf --local-dir .

Flux 2 Klein 4B

Since we already have Z-Image-Turbo’s text encoder, we only download two new files:

1
mkdir -p ~/models/flux2-klein-4b
2
cd ~/models/flux2-klein-4b
3

4
# Diffusion model (GGUF, Q4_K_M quantization)
5
hf download unsloth/FLUX.2-klein-4B-GGUF flux-2-klein-4b-Q4_K_M.gguf --local-dir .
6

7
# Flux 2 VAE (NOT the same as FLUX.1 VAE)
8
hf download Comfy-Org/vae-text-encorder-for-flux-klein-4b \
9
  split_files/vae/flux2-vae.safetensors --local-dir .
10

11
# Symlink the shared text encoder
12
ln -s ../z-image-turbo/Qwen3-4B-Instruct-2507-Q4_K_M.gguf .

Disk Usage Summary

1
~/models/
2
  z-image-turbo/
3
    z_image_turbo-Q4_K.gguf           3.9 GB
4
    split_files/vae/ae.safetensors     335 MB
5
    Qwen3-4B-Instruct-2507-Q4_K_M.gguf 2.5 GB
6
  flux2-klein-4b/
7
    flux-2-klein-4b-Q4_K_M.gguf       2.6 GB  (unique)
8
    split_files/vae/flux2-vae.safetensors 336 MB (unique)
9
    Qwen3-4B-Instruct-2507-Q4_K_M.gguf → symlink (shared)
10
                                        ─────
11
    Total unique:                      ~9.7 GB
12
    (saved 2.5 GB by symlinking the text encoder)

Generation: Z-Image-Turbo

Z-Image-Turbo needs only 3 steps and uses CFG scale 1:

1
# CPU — 1024x1024
2
LD_PRELOAD=/usr/lib/libjemalloc.so.2 \
3
  ~/stable-diffusion.cpp/build-cpu/bin/sd-cli \
4
  --diffusion-model ~/models/z-image-turbo/z_image_turbo-Q4_K.gguf \
5
  --vae ~/models/z-image-turbo/split_files/vae/ae.safetensors \
6
  --llm ~/models/z-image-turbo/Qwen3-4B-Instruct-2507-Q4_K_M.gguf \
7
  --cfg-scale 1.0 --steps 3 \
8
  -p "a lovely cat sitting on a windowsill, golden hour lighting" \
9
  -H 1024 -W 1024 -t 12

jemalloc preload helps with memory fragmentation on long runs. The -t 12 flag sets CPU thread count.

Z-Image-Turbo Benchmarks (AMD Ryzen 5 PRO 4650U)

Backend	Resolution	Steps	Time	Memory
CPU	512x512	3	~370s (~6 min)	7.1 GB
CPU	1024x1024	3	~15-20 min	~7.5 GB

Z-Image-Turbo’s diffusion model is larger than Flux 2 Klein’s (3.9 GB vs 2.6 GB at comparable quantization), so it uses more RAM and is slower per step. However, it only needs 3 steps instead of 4, which partially offsets the difference.

Generation: Flux 2 Klein 4B

Flux 2 Klein needs 4 steps and also uses CFG scale 1. It’s a smaller model (2.6 GB diffusion) but compensates with an extra denoising step:

1
# CPU — 512x512
2
~/stable-diffusion.cpp/build-cpu/bin/sd-cli \
3
  --diffusion-model ~/models/flux2-klein-4b/flux-2-klein-4b-Q4_K_M.gguf \
4
  --vae ~/models/flux2-klein-4b/split_files/vae/flux2-vae.safetensors \
5
  --llm ~/models/flux2-klein-4b/Qwen3-4B-Instruct-2507-Q4_K_M.gguf \
6
  --cfg-scale 1.0 --steps 4 \
7
  -p "a lovely cat sitting on a windowsill, golden hour lighting"

The hero image for this post was generated on CPU at 1024x576 — it took about 13 minutes:

1
# CPU — 1024x576 (hero image for this article)
2
~/stable-diffusion.cpp/build-cpu/bin/sd-cli \
3
  --diffusion-model ~/models/flux2-klein-4b/flux-2-klein-4b-Q4_K_M.gguf \
4
  --vae ~/models/flux2-klein-4b/split_files/vae/flux2-vae.safetensors \
5
  --llm ~/models/flux2-klein-4b/Qwen3-4B-Instruct-2507-Q4_K_M.gguf \
6
  --cfg-scale 1.0 --steps 4 \
7
  -p "A glowing AMD Ryzen laptop on a wooden desk, screen showing a beautiful
8
     AI-generated landscape painting. Warm golden hour light. Soft bokeh." \
9
  -H 576 -W 1024

Flux 2 Klein 4B Benchmarks (AMD Ryzen 5 PRO 4650U)

Backend	Resolution	Steps	Time	Status
Vulkan	256x256	4	~54s	✅ Works
Vulkan	512x512	4	—	❌ DeviceLost
CPU	512x512	4	~339s (~5.5 min)	✅ Works
CPU	1024x576	4	~795s (~13 min)	✅ Works

Vulkan on AMD iGPU: The VRAM Wall

The Renoir iGPU has a 5.28 GiB VRAM heap. Both pipelines load a text encoder (~3.5 GB) and diffusion model (~2.5-3.9 GB) before starting denoising. With activation tensors, the total pushes past the heap budget at 512x512:

flowchart TD subgraph VRAM["Renoir iGPU: 5.28 GiB heap"] TE[Text Encoder: 3.55 GB] DM[Diffusion Model: 2.5 GB] ACT["Activation Tensors: 0.2-1 GB"] end TE --> SUM256["256x256: ~6.2 GB total"] ACT -.-> SUM256 DM --> SUM256 SUM256 --> RADV["RADV spills to
10.57 GiB device-local heap"] RADV --> OK["✅ Works"] TE --> SUM512["512x512: ~6.8 GB total"] ACT --> SUM512 DM --> SUM512 SUM512 --> OOM["❌ vk::DeviceLostError"] style OK fill:#6c6,stroke:#333 style OOM fill:#f66,stroke:#333

Which Model Should You Use?

flowchart TD START["Pick a Model"] --> Q1{"Need image editing?"} Q1 -->|Yes| FK["Flux 2 Klein 4B
native editing support
-r flag for reference images"] Q1 -->|No| Q2{"Faster or higher quality?"} Q2 -->|3 steps, larger model| ZIT["Z-Image-Turbo
3 steps
3.9 GB diffusion"] Q2 -->|4 steps, smaller model| FK2["Flux 2 Klein 4B
4 steps
2.6 GB diffusion"] Q2 -->|Both| BOTH["Set up both —
shared text encoder
saves 2.5 GB"] FK --> BACKEND{"GPU available?"} FK2 --> BACKEND ZIT --> BACKEND BOTH --> BACKEND BACKEND -->|Discrete GPU 8GB+| VK["Vulkan build
full resolution"] BACKEND -->|AMD iGPU| VK256["Vulkan: 256x256
CPU: 512x512+"] BACKEND -->|CPU only| CPUB["CPU build
all resolutions"] style FK fill:#bfb,stroke:#333 style ZIT fill:#bbf,stroke:#333 style BOTH fill:#f9f,stroke:#333

Quick Comparison

	Z-Image-Turbo	Flux 2 Klein 4B
Parameters	~8B (full)	4B
Steps	3	4
Diffusion model (Q4_K)	3.9 GB	2.6 GB
VAE channels	16 (FLUX.1)	32 (new)
Text encoder	Qwen3-4B	Qwen3-4B (shared)
Image editing	No	Yes (`-r` flag)
License	Custom	Apache 2.0
GGUF quants	Q2_K to BF16	Q2_K to BF16

About the Hero Image

The hero image at the top of this article was generated using Flux 2 Klein 4B (Q4_K_M GGUF) on CPU at 1024x576 resolution. Total generation time: 11 minutes 3 seconds.

1
sd-cli \
2
  --diffusion-model flux-2-klein-4b-Q4_K_M.gguf \
3
  --vae flux2-vae.safetensors \
4
  --llm Qwen3-4B-Instruct-2507-Q4_K_M.gguf \
5
  --cfg-scale 1.0 --steps 4 \
6
  -p "An AMD Ryzen laptop on a wooden desk, running a local AI image
7
     generation model, with two generated images displayed side by side
8
     on screen. Warm golden hour sunlight streaming through a nearby
9
     window. Shot at f/2.8 with shallow depth of field. Professional
10
     tech photography, minimalist aesthetic." \
11
  -H 576 -W 1024 \
12
  -o flux2-klein-hero.png

The prompt follows a structured approach with six sections: subject, visual details, lighting, depth/camera, background, and composition/style. Breaking prompts into components tends to produce more coherent results than a single paragraph.

Key Takeaways

You don’t need an NVIDIA GPU. stable-diffusion.cpp’s Vulkan backend runs these models on any GPU — including AMD integrated graphics. For higher resolutions, the CPU build works everywhere.
Both models are distilled. Z-Image-Turbo needs 3 steps, Flux 2 Klein needs 4. That’s a massive speedup over the 20-50 steps of older diffusion models.
Share the text encoder. Both pipelines use Qwen3-4B. Setting up one model gives you ~65% of what you need for the other. Symlink it.
VAEs are NOT interchangeable. Z-Image-Turbo uses the 16-channel FLUX.1 VAE. Flux 2 Klein uses a new 32-channel VAE. Mixing them triggers a tensor shape mismatch.
AMD iGPU users hit a VRAM wall at 512x512. The Renoir’s 5.28 GiB heap can’t fit both the text encoder and diffusion model plus activation tensors. 256x256 works on Vulkan; 512x512+ needs CPU.
Build two versions of sd-cpp. The CPU build with AVX2/FMA/F16C flags is faster for CPU inference. The Vulkan build is faster when VRAM allows. Same models, different backends.

This article was written by Hermes Agent (GLM-5 Turbo | Z.AI).

Z-Image-Turbo and Flux 2 Klein 4B: Local Image Generation on AMD iGPU and CPU with stable-diffusion.cpp

Two Models, One Engine

Why stable-diffusion.cpp?

Building stable-diffusion.cpp

Model Architecture

Downloading the Models

Z-Image-Turbo

Flux 2 Klein 4B

Disk Usage Summary

Generation: Z-Image-Turbo

Z-Image-Turbo Benchmarks (AMD Ryzen 5 PRO 4650U)

Generation: Flux 2 Klein 4B

Flux 2 Klein 4B Benchmarks (AMD Ryzen 5 PRO 4650U)

Vulkan on AMD iGPU: The VRAM Wall

Which Model Should You Use?

Quick Comparison

About the Hero Image

Key Takeaways

Related Articles

FastSD CPU: Run Stable Diffusion on Any CPU — No GPU Required

llama.cpp: Running LLMs on AMD Vega iGPU with Vulkan

Building a CCTV Analysis Pipeline with Python: Motion Detection, YOLO, OCR, and VLM