FastSD CPU: Run Stable Diffusion on Any CPU

TL;DR

FastSD CPU is an open-source tool that runs Stable Diffusion entirely on CPU — no GPU required. By combining Latent Consistency Models (LCM) and Intel’s OpenVINO runtime, it generates a 512x512 image in 0.82 seconds on a Core i7-12700. It works on Windows, Linux, Mac, Android, and even Raspberry Pi 4. Minimum RAM requirement: 2 GB.

The GPU Problem

Local image generation is dominated by one assumption: you need a GPU. NVIDIA’s consumer cards keep climbing — RTX 5090 rumors put it at $5,000. Cloud GPU rentals add up. And if you’re on a laptop with an AMD chip or integrated graphics? The conventional wisdom says you’re out of luck.

FastSD CPU challenges that assumption entirely.

How It Works

Standard Stable Diffusion needs 20-50 denoising steps to produce an image. That’s what makes it slow on CPU. FastSD CPU sidesteps this with two techniques:

Latent Consistency Models (LCM) — distills the diffusion process so it converges in just 2-4 steps instead of 20-50. The tradeoff is minimal quality loss for most prompts.

Adversarial Diffusion Distillation (ADD) — used by Turbo models (SD Turbo, SDXL Turbo), pushes it further to a single step. One forward pass, one image.

Then there’s OpenVINO, Intel’s inference optimization toolkit. It compiles the model into an optimized form that runs significantly faster on x86 CPUs — roughly 2-5x speedup over vanilla PyTorch. And it works on AMD processors too, not just Intel.

Benchmarks

All tests on Intel Core i7-12700 (12 cores, no discrete GPU):

1-Step Models (Fastest)

Model	Pipeline	Resolution	Latency
SDXS-512-0.9	OpenVINO + TAESD	512x512	0.82s
SD Turbo	OpenVINO + TAESD	512x512	1.7s
SDXL Turbo	OpenVINO + TAESDXL	512x512	2.5s
Hyper-SD SDXL	OpenVINO + TAESDXL	768x768	6.3s

2-Step Models

Model	Pipeline	Resolution	Latency
SDXL Lightning	OpenVINO + TAESDXL	768x768	10s
LCM-LoRA	PyTorch	512x512	~15s

FLUX.1 schnell (Heavy)

Pipeline	Resolution	Latency	RAM Required
OpenVINO int4	512x512	~4m 30s	~30 GB

Hardware Requirements

Mode	Min RAM	Notes
LCM	2 GB	Bare minimum, works on anything
LCM-LoRA	4 GB	Better quality, works on older laptops
OpenVINO	11 GB	Best speed, needs more RAM
OpenVINO + TAESD	9 GB	Tiny decoder saves ~2 GB
FLUX.1 OpenVINO int4	~30 GB	Experimental, very slow

Guidance scale above 1.0 increases both RAM usage and inference time. Keep it at 1.0 for fastest results.

What Runs It

This is the impressive part. FastSD CPU has been tested on:

Windows, Linux, Mac — the expected trio
Android — via Termux + PRoot, tested on Pixel 7 Pro
Raspberry Pi 4 — 4 GB RAM + 8 GB swap, no issues

Your author’s machine (AMD Ryzen 5 PRO 4650U, 30 GB RAM, no NVIDIA GPU) sits squarely in the target audience. LCM mode would work comfortably. OpenVINO mode is within the RAM budget.

Interfaces

FastSD CPU isn’t just a script — it ships with multiple ways to interact:

Interface	Best For
Qt Desktop GUI	Quick generation, basic features
WebUI	Full features: LoRA, ControlNet, img2img, upscaling
CLI	Automation, scripting, batch generation
REST API	Integration with other apps (`/api/generate`)
MCP Server	Claude Desktop, Open WebUI integration
ComfyUI Node	Existing ComfyUI workflows
GIMP Plugin	Image editing pipeline (via Intel OpenVINO Plugins)

Key Features

Beyond basic text-to-image:

Image-to-image — transform existing images with a prompt
LoRA support — single and multi-LoRA, including fine-tuned CivitAI models
ControlNet v1.1 — Canny, Depth, LineArt, Pose, SoftEdge, and more annotators
Built-in upscalers — EDSR 2x, Aura SR 4x, SD upscale
Real-time generation — generates images as you type (experimental, 512x512 at 0.82s)
CLIP skip & token merging — fine-grained control over generation
Multiple image sizes — 256, 512, 768, 1024
Safetensors support — drop in any SD 1.5 or SDXL model from CivitAI

Installation

Prerequisites: Python 3.10+ and uv.

1
git clone https://github.com/rupeshs/fastsdcpu.git
2
cd fastsdcpu
3
chmod +x install.sh
4
./install.sh

For Windows, double-click install.bat instead.

Start the desktop GUI:

1
./start.sh          # Desktop (Qt)
2
./start-webui.sh    # WebUI (advanced features)

Models download on first use from Hugging Face. The default is SD Turbo.

AI PC Support (Intel Core Ultra)

If you have an Intel Core Ultra processor with NPU (Meteor Lake or Lunar Lake), FastSD can offload inference to the Neural Processing Unit for power-efficient generation:

1
export DEVICE=NPU
2
./start-webui.sh

Heterogeneous computing kicks in — text encoder and UNet run on the NPU, VAE on the GPU. This only works with Intel NPUs, not AMD.

GGUF Flux: The RAM-Saver

FastSD also supports FLUX.1 schnell via GGUF quantization through stablediffusion.cpp. The key advantage: it drops FLUX’s RAM requirement from ~30 GB (OpenVINO int4) to around 12 GB by using quantized models. Still slow on CPU, but at least it fits in more machines.

MCP Server: Generate Images from Claude Desktop

One of the more interesting integrations — FastSD exposes an MCP (Model Context Protocol) server:

1
python src/app.py --mcp

Add to Claude Desktop config:

1
{
2
  "mcpServers": {
3
    "fastsdcpu": {
4
      "command": "npx",
5
      "args": ["mcp-remote", "http://127.0.0.1:8000/mcp"]
6
    }
7
  }
8
}

Now you can ask Claude to generate images and it calls FastSD CPU as a tool. Works with Open WebUI too.

Limitations

FLUX is slow — 4+ minutes per image on CPU, not practical for interactive use
OpenVINO is Intel-optimized — works on AMD but NPU/GPU features are Intel-only
No SD3 or SD3.5 support — stuck on SD 1.5 / SDXL ecosystem
Quality ceiling — distilled models trade quality for speed, especially at 1 step
No ControlNet in OpenVINO mode — ControlNet only works in LCM-LoRA mode
Mac M-series — no OpenVINO support (use MPS with export DEVICE=mps)

Verdict

FastSD CPU solves a real problem: democratizing local image generation beyond the GPU-haves. For anyone on a CPU-only machine — whether it’s a budget laptop, a homelab server, or a Raspberry Pi — it’s the most practical option available.

The 0.82-second benchmark on SDXS-512-0.9 is genuinely impressive. For quick prototyping, concept art, and batch generation where perfection isn’t the goal, it’s more than sufficient.

The project is actively maintained (527 commits, last updated 3 months ago), has 2k GitHub stars, and was even integrated into Intel’s official OpenVINO AI Plugins for GIMP. It’s not a toy.

Use it if: You don’t have a GPU, want local image generation, and can live with SD 1.5/SDXL quality.

Skip it if: You need FLUX/SD3 quality, real-time generation at high resolution, or you already have a decent GPU.

This article was written by Claude (Claude 3.5 Sonnet | Anthropic).

FastSD CPU: Run Stable Diffusion on Any CPU — No GPU Required

TL;DR

The GPU Problem

How It Works

Benchmarks

1-Step Models (Fastest)

2-Step Models

FLUX.1 schnell (Heavy)

Hardware Requirements

What Runs It

Interfaces

Key Features

Installation

AI PC Support (Intel Core Ultra)

GGUF Flux: The RAM-Saver

MCP Server: Generate Images from Claude Desktop

Limitations

Verdict

Related Articles

Z-Image-Turbo and Flux 2 Klein 4B: Local Image Generation on AMD iGPU and CPU with stable-diffusion.cpp

Building a CCTV Analysis Pipeline with Python: Motion Detection, YOLO, OCR, and VLM

llama.cpp: Running LLMs on AMD Vega iGPU with Vulkan