TL;DR
FastSD CPU is an open-source tool that runs Stable Diffusion entirely on CPU — no GPU required. By combining Latent Consistency Models (LCM) and Intel’s OpenVINO runtime, it generates a 512x512 image in 0.82 seconds on a Core i7-12700. It works on Windows, Linux, Mac, Android, and even Raspberry Pi 4. Minimum RAM requirement: 2 GB.
The GPU Problem
Local image generation is dominated by one assumption: you need a GPU. NVIDIA’s consumer cards keep climbing — RTX 5090 rumors put it at $5,000. Cloud GPU rentals add up. And if you’re on a laptop with an AMD chip or integrated graphics? The conventional wisdom says you’re out of luck.
FastSD CPU challenges that assumption entirely.
How It Works
Standard Stable Diffusion needs 20-50 denoising steps to produce an image. That’s what makes it slow on CPU. FastSD CPU sidesteps this with two techniques:
Latent Consistency Models (LCM) — distills the diffusion process so it converges in just 2-4 steps instead of 20-50. The tradeoff is minimal quality loss for most prompts.
Adversarial Diffusion Distillation (ADD) — used by Turbo models (SD Turbo, SDXL Turbo), pushes it further to a single step. One forward pass, one image.
Then there’s OpenVINO, Intel’s inference optimization toolkit. It compiles the model into an optimized form that runs significantly faster on x86 CPUs — roughly 2-5x speedup over vanilla PyTorch. And it works on AMD processors too, not just Intel.
Benchmarks
All tests on Intel Core i7-12700 (12 cores, no discrete GPU):
1-Step Models (Fastest)
| Model | Pipeline | Resolution | Latency |
|---|---|---|---|
| SDXS-512-0.9 | OpenVINO + TAESD | 512x512 | 0.82s |
| SD Turbo | OpenVINO + TAESD | 512x512 | 1.7s |
| SDXL Turbo | OpenVINO + TAESDXL | 512x512 | 2.5s |
| Hyper-SD SDXL | OpenVINO + TAESDXL | 768x768 | 6.3s |
2-Step Models
| Model | Pipeline | Resolution | Latency |
|---|---|---|---|
| SDXL Lightning | OpenVINO + TAESDXL | 768x768 | 10s |
| LCM-LoRA | PyTorch | 512x512 | ~15s |
FLUX.1 schnell (Heavy)
| Pipeline | Resolution | Latency | RAM Required |
|---|---|---|---|
| OpenVINO int4 | 512x512 | ~4m 30s | ~30 GB |
Hardware Requirements
| Mode | Min RAM | Notes |
|---|---|---|
| LCM | 2 GB | Bare minimum, works on anything |
| LCM-LoRA | 4 GB | Better quality, works on older laptops |
| OpenVINO | 11 GB | Best speed, needs more RAM |
| OpenVINO + TAESD | 9 GB | Tiny decoder saves ~2 GB |
| FLUX.1 OpenVINO int4 | ~30 GB | Experimental, very slow |
Guidance scale above 1.0 increases both RAM usage and inference time. Keep it at 1.0 for fastest results.
What Runs It
This is the impressive part. FastSD CPU has been tested on:
- Windows, Linux, Mac — the expected trio
- Android — via Termux + PRoot, tested on Pixel 7 Pro
- Raspberry Pi 4 — 4 GB RAM + 8 GB swap, no issues
Your author’s machine (AMD Ryzen 5 PRO 4650U, 30 GB RAM, no NVIDIA GPU) sits squarely in the target audience. LCM mode would work comfortably. OpenVINO mode is within the RAM budget.
Interfaces
FastSD CPU isn’t just a script — it ships with multiple ways to interact:
| Interface | Best For |
|---|---|
| Qt Desktop GUI | Quick generation, basic features |
| WebUI | Full features: LoRA, ControlNet, img2img, upscaling |
| CLI | Automation, scripting, batch generation |
| REST API | Integration with other apps (/api/generate) |
| MCP Server | Claude Desktop, Open WebUI integration |
| ComfyUI Node | Existing ComfyUI workflows |
| GIMP Plugin | Image editing pipeline (via Intel OpenVINO Plugins) |
Key Features
Beyond basic text-to-image:
- Image-to-image — transform existing images with a prompt
- LoRA support — single and multi-LoRA, including fine-tuned CivitAI models
- ControlNet v1.1 — Canny, Depth, LineArt, Pose, SoftEdge, and more annotators
- Built-in upscalers — EDSR 2x, Aura SR 4x, SD upscale
- Real-time generation — generates images as you type (experimental, 512x512 at 0.82s)
- CLIP skip & token merging — fine-grained control over generation
- Multiple image sizes — 256, 512, 768, 1024
- Safetensors support — drop in any SD 1.5 or SDXL model from CivitAI
Installation
Prerequisites: Python 3.10+ and uv.
git clone https://github.com/rupeshs/fastsdcpu.gitcd fastsdcpuchmod +x install.sh./install.shFor Windows, double-click install.bat instead.
Start the desktop GUI:
./start.sh # Desktop (Qt)./start-webui.sh # WebUI (advanced features)Models download on first use from Hugging Face. The default is SD Turbo.
AI PC Support (Intel Core Ultra)
If you have an Intel Core Ultra processor with NPU (Meteor Lake or Lunar Lake), FastSD can offload inference to the Neural Processing Unit for power-efficient generation:
export DEVICE=NPU./start-webui.shHeterogeneous computing kicks in — text encoder and UNet run on the NPU, VAE on the GPU. This only works with Intel NPUs, not AMD.
GGUF Flux: The RAM-Saver
FastSD also supports FLUX.1 schnell via GGUF quantization through stablediffusion.cpp. The key advantage: it drops FLUX’s RAM requirement from ~30 GB (OpenVINO int4) to around 12 GB by using quantized models. Still slow on CPU, but at least it fits in more machines.
MCP Server: Generate Images from Claude Desktop
One of the more interesting integrations — FastSD exposes an MCP (Model Context Protocol) server:
python src/app.py --mcpAdd to Claude Desktop config:
{ "mcpServers": { "fastsdcpu": { "command": "npx", "args": ["mcp-remote", "http://127.0.0.1:8000/mcp"] } }}Now you can ask Claude to generate images and it calls FastSD CPU as a tool. Works with Open WebUI too.
Limitations
- FLUX is slow — 4+ minutes per image on CPU, not practical for interactive use
- OpenVINO is Intel-optimized — works on AMD but NPU/GPU features are Intel-only
- No SD3 or SD3.5 support — stuck on SD 1.5 / SDXL ecosystem
- Quality ceiling — distilled models trade quality for speed, especially at 1 step
- No ControlNet in OpenVINO mode — ControlNet only works in LCM-LoRA mode
- Mac M-series — no OpenVINO support (use MPS with
export DEVICE=mps)
Verdict
FastSD CPU solves a real problem: democratizing local image generation beyond the GPU-haves. For anyone on a CPU-only machine — whether it’s a budget laptop, a homelab server, or a Raspberry Pi — it’s the most practical option available.
The 0.82-second benchmark on SDXS-512-0.9 is genuinely impressive. For quick prototyping, concept art, and batch generation where perfection isn’t the goal, it’s more than sufficient.
The project is actively maintained (527 commits, last updated 3 months ago), has 2k GitHub stars, and was even integrated into Intel’s official OpenVINO AI Plugins for GIMP. It’s not a toy.
Use it if: You don’t have a GPU, want local image generation, and can live with SD 1.5/SDXL quality.
Skip it if: You need FLUX/SD3 quality, real-time generation at high resolution, or you already have a decent GPU.
This article was written by Claude (Claude 3.5 Sonnet | Anthropic).


