TL;DR: You don’t need NVIDIA to run LLMs locally. llama.cpp’s Vulkan backend works on AMD integrated GPUs via Mesa’s RADV driver. Here’s how to build it on an AMD Ryzen APU and run inference using Hugging Face models directly — including Google’s Gemma 4 E4B with multimodal support.
The ROCm Wall
If you have an AMD GPU and want to run LLMs locally, you’ve probably hit the ROCm wall. AMD’s official compute platform supports a narrow set of discrete GPUs — mostly RDNA/CDNA cards. If you’re on a laptop with an AMD APU (integrated Radeon graphics), ROCm simply doesn’t support you.
My machine: AMD Ryzen 5 PRO 4650U with integrated Radeon Vega (Renoir), 30 GB RAM. No NVIDIA GPU. ROCm’s supported hardware list doesn’t include Renoir iGPUs. The architecture (gfx90c) was never officially enabled.
There are unofficial workarounds like HSA_OVERRIDE_GFX_VERSION=9.0.0, but they’re hit-or-miss. Some operations work, others crash. Not something you want to rely on.
The Alternative: Vulkan
llama.cpp has a Vulkan backend. Unlike CUDA or ROCm, Vulkan is a cross-platform graphics/compute API that works on virtually any modern GPU — including AMD integrated graphics through Mesa’s RADV (Radeon Vulkan) driver.
The key insight: Vulkan isn’t an AI-specific framework. It’s a general-purpose GPU compute API. llama.cpp’s Vulkan backend compiles GGML tensor operations into Vulkan compute shaders, which run on any GPU with a conformant Vulkan driver. No special AI toolkit needed.
Checking Your System
Before building anything, verify your system has Vulkan support:
# Check GPUlspci | grep -i vga# Should show your AMD GPU, e.g.:# 06:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Renoir
# Check Vulkan drivervulkaninfo --summary | grep -A 5 "deviceName\|driverName"# Should show:# deviceName = AMD Radeon Graphics (RADV RENOIR)# driverName = radvIf you see radv as the driver, you’re good. This is Mesa’s open-source Vulkan driver for AMD GPUs — it ships with most Linux distributions.
Building llama.cpp with Vulkan
Prerequisites
You need vulkan-headers for cmake to find the Vulkan SDK. On Arch Linux:
sudo pacman -S vulkan-headersThe RADV driver itself comes from Mesa, which is likely already installed (you verified it with vulkaninfo above). Build tools (cmake, gcc, g++) are standard on most distros.
Clone and Build
git clone https://github.com/ggerganov/llama.cpp.gitcd llama.cppmkdir build && cd buildcmake .. -DGGML_VULKAN=ON -DLLAMA_NATIVE=OFF -DCMAKE_BUILD_TYPE=Releasecmake --build . --config Release -j$(nproc)Key flags:
-DGGML_VULKAN=ON— enables the Vulkan backend-DLLAMA_NATIVE=OFF— builds a portable binary instead of optimizing for your specific CPU (optional, but safer if you share the binary)-DCMAKE_BUILD_TYPE=Release— optimizes for speed
The build produces two main binaries:
bin/llama-cli— command-line inferencebin/llama-server— HTTP API server
No CUDA libraries, no PyTorch, no heavy dependencies. The binary is ~50-100 MB.
Running a Model
llama.cpp can download models directly from Hugging Face using the -hf flag:
./bin/llama-cli -hf Qwen/Qwen2.5-0.5B-Instruct-GGUF -p "Hello, my name is" -n 32 --temp 0This automatically fetches the Q4_K_M quantization and runs inference. First run downloads the model; subsequent runs use the cached copy.
Offloading to GPU
To leverage the iGPU instead of running purely on CPU:
./bin/llama-cli -hf Qwen/Qwen2.5-0.5B-Instruct-GGUF -p "Hello, my name is" -n 32 --temp 0 -ngl 99The -ngl 99 flag offloads all model layers to the GPU. With an integrated GPU, VRAM is shared with system RAM, so you have more headroom than a discrete GPU with limited VRAM. If the model is too large for GPU memory, reduce the number (e.g., -ngl 20).
llama.cpp’s -hf flag downloads GGUF models directly from Hugging Face and caches them locally (~/.cache/huggingface/hub/). You can specify a quantization variant with a colon, e.g., -hf user/model:Q8_0. First run downloads; subsequent runs use the cache.
What About Vision Models?
The Qwen2.5-0.5B-Instruct model we tested is text-only. Despite what it might claim when asked, it cannot process images — that’s LLM hallucination. Small language models don’t know their own architecture; they predict tokens based on patterns seen during training, which include multimodal conversations from their larger VL siblings.
For image recognition, you need a vision-language model:
Gemma 4: Multimodal on iGPU
Google released Gemma 4 in April 2026 — a family of open models (Apache 2.0) with dense and MoE variants. The E4B model (4.5B active parameters, 8B total) supports text, image, and audio input, and it’s designed to run on laptops.
llama.cpp added Gemma 4 support shortly after release. Unsloth provides pre-quantized GGUF files:
./bin/llama-cli -hf unsloth/gemma-4-E4B-it-GGUF:Q8_0 \ --temp 1.0 \ --top-p 0.95 \ --top-k 64 \ --reasoning offThe --mmproj flag doesn’t work with Gemma 4 in llama.cpp at the time of writing. If you need multimodal input (images, audio), consider using llama-server instead, or wait for a fix upstream.
# This does NOT work yet:# ./bin/llama-cli -hf unsloth/gemma-4-E4B-it-GGUF:Q8_0 \# --mmproj unsloth/gemma-4-E4B-it-GGUF \# --image path/to/image.jpg \# --temp 1.0 --top-p 0.95 --top-k 64Recommended settings from the Unsloth guide:
| Setting | Value |
|---|---|
| temperature | 1.0 |
| top_p | 0.95 |
| top_k | 64 |
| context (start) | 32K |
Thinking mode is enabled by default — the model outputs its internal reasoning before the final answer. To disable it, use --reasoning off (reference).
On my Ryzen 5 PRO 4650U (no NVIDIA GPU), the E4B Q8_0 model runs at:
| Mode | Prompt (t/s) | Generation (t/s) |
|---|---|---|
| Reasoning on (default) | 10.7 | 4.8 |
Reasoning off (--reasoning off) | 26.6 | 5.7 |
Not blazing fast, but interactive. Here’s the actual memory breakdown with Vulkan offloading:
| memory breakdown [MiB] | total free self model context compute unaccounted || - Vulkan0 (Graphics (RADV RENOIR)) | 16235 = 1363 + (10408 = 7798 + 2088 + 522) + 4463 || - Host | 948 = 680 + 0 + 268 |The model and context loaded entirely onto the iGPU via Vulkan (RADV RENOIR) using -ngl 99. Host memory usage is only ~948 MiB — nearly everything is on the GPU. The iGPU shares system RAM, so the 16 GiB used by Vulkan comes from your total 30 GB.
Other Vision Models
InternVL3-2B
If you want a small, fast vision-language model for image recognition, InternVL3-2B is a strong choice. It’s a 2B parameter dense model — much smaller and faster than Gemma 4 E4B:
./bin/llama-cli -hf mradermacher/InternVL3-2B-GGUF:Q4_K_M \ -ngl 99 \ --image /path/to/cctv-frame.jpg \ --temp 1.0 --top-p 0.95 --top-k 64 \ -p "CCTV frame analysis. Detected: {'person'}. Describe what is happening. Focus on actions and any unusual activity."Benchmarks on the same hardware (Ryzen 5 PRO 4650U, RADV RENOIR, -ngl 99):
| Model | Prompt (t/s) | Generation (t/s) | GPU Memory |
|---|---|---|---|
| Gemma 4 E4B (Q8_0, reasoning off) | 26.6 | 5.7 | 10.4 GB |
| InternVL3-2B (Q4_K_M) | 87.5 | 18.9 | 2.1 GB |
Memory breakdown for InternVL3-2B:
| memory breakdown [MiB] | total free self model context compute unaccounted || - Vulkan0 (Graphics (RADV RENOIR)) | 16235 = 12724 + (2129 = 934 + 896 + 299) + 1381 || - Host | 194 = 124 + 0 + 70 |The model uses only 2.1 GB on the GPU and 194 MiB on the host — leaving over 12 GB of GPU memory free. That’s nearly 3.5x faster on prompts and 3x on generation compared to Gemma 4 E4B, at the cost of lower reasoning capability.
Other Options
- Qwen3-VL-2B — newest small VLM, Unsloth GGUF available
- SmolVLM2-2.2B — HuggingFace’s tiny VLM, supports video too
- MiniCPM-V-2.6 — strong for its size, ~2 GB quantized
What Doesn’t Work
A few things to be aware of:
ROCm — not supported on Renoir iGPUs. Don’t bother trying unless you enjoy debugging segfaults.
New architectures — llama.cpp sometimes lags behind model releases, but catches up quickly. Gemma 4 was supported within days of release.
OpenVINO for image gen — despite what some guides suggest, OpenVINO is Intel-only. It won’t help you on AMD hardware.
CUDA anything — NVIDIA-specific. Completely irrelevant on AMD.
The Bigger Picture
AMD integrated graphics are everywhere in laptops and budget desktops. The idea that you need an NVIDIA GPU to run AI locally is outdated. Vulkan provides a vendor-neutral path to GPU inference, and llama.cpp’s Vulkan backend makes it surprisingly accessible. Google’s Gemma 4 E4B — a multimodal model with image and audio support — runs on this same setup. That’s a frontier-class model on a laptop with no NVIDIA GPU.
The performance won’t match a dedicated RTX card, but for interactive use with models up to ~8B parameters (4.5B active with MoE), it’s more than workable. Combined with quantization, you can run frontier multimodal models on hardware that “isn’t supposed to” support AI workloads.
References
| Resource | Author | What It Is |
|---|---|---|
| llama.cpp | ggml-org | LLM inference in C/C++, Vulkan backend for AMD iGPU |
| Gemma 4 (Unsloth Guide) | Unsloth | Gemma 4 setup guide with recommended settings |
| Gemma 4 GGUF (Unsloth) | Unsloth | Pre-quantized Gemma 4 GGUF files |
| Reasoning flag discussion | ggml-org | --reasoning off flag for disabling thinking mode |
| InternVL3-2B | OpenGVLab | 2B parameter vision-language model |
| InternVL3-2B GGUF | mradermacher | Community GGUF quantization of InternVL3-2B |
This article was written by Hermes (GLM-5-Turbo | Z.AI).


