5 min read
ai tutorial

llama.cpp: Running LLMs on AMD Vega iGPU with Vulkan

TL;DR: You don’t need NVIDIA to run LLMs locally. llama.cpp’s Vulkan backend works on AMD integrated GPUs via Mesa’s RADV driver. Here’s how to build it on an AMD Ryzen APU and run inference using Hugging Face models directly — including Google’s Gemma 4 E4B with multimodal support.


The ROCm Wall

If you have an AMD GPU and want to run LLMs locally, you’ve probably hit the ROCm wall. AMD’s official compute platform supports a narrow set of discrete GPUs — mostly RDNA/CDNA cards. If you’re on a laptop with an AMD APU (integrated Radeon graphics), ROCm simply doesn’t support you.

My machine: AMD Ryzen 5 PRO 4650U with integrated Radeon Vega (Renoir), 30 GB RAM. No NVIDIA GPU. ROCm’s supported hardware list doesn’t include Renoir iGPUs. The architecture (gfx90c) was never officially enabled.

There are unofficial workarounds like HSA_OVERRIDE_GFX_VERSION=9.0.0, but they’re hit-or-miss. Some operations work, others crash. Not something you want to rely on.

The Alternative: Vulkan

llama.cpp has a Vulkan backend. Unlike CUDA or ROCm, Vulkan is a cross-platform graphics/compute API that works on virtually any modern GPU — including AMD integrated graphics through Mesa’s RADV (Radeon Vulkan) driver.

The key insight: Vulkan isn’t an AI-specific framework. It’s a general-purpose GPU compute API. llama.cpp’s Vulkan backend compiles GGML tensor operations into Vulkan compute shaders, which run on any GPU with a conformant Vulkan driver. No special AI toolkit needed.

Checking Your System

Before building anything, verify your system has Vulkan support:

Terminal window
# Check GPU
lspci | grep -i vga
# Should show your AMD GPU, e.g.:
# 06:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Renoir
# Check Vulkan driver
vulkaninfo --summary | grep -A 5 "deviceName\|driverName"
# Should show:
# deviceName = AMD Radeon Graphics (RADV RENOIR)
# driverName = radv

If you see radv as the driver, you’re good. This is Mesa’s open-source Vulkan driver for AMD GPUs — it ships with most Linux distributions.

Building llama.cpp with Vulkan

Prerequisites

You need vulkan-headers for cmake to find the Vulkan SDK. On Arch Linux:

Terminal window
sudo pacman -S vulkan-headers

The RADV driver itself comes from Mesa, which is likely already installed (you verified it with vulkaninfo above). Build tools (cmake, gcc, g++) are standard on most distros.

Clone and Build

Terminal window
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
mkdir build && cd build
cmake .. -DGGML_VULKAN=ON -DLLAMA_NATIVE=OFF -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release -j$(nproc)

Key flags:

  • -DGGML_VULKAN=ON — enables the Vulkan backend
  • -DLLAMA_NATIVE=OFF — builds a portable binary instead of optimizing for your specific CPU (optional, but safer if you share the binary)
  • -DCMAKE_BUILD_TYPE=Release — optimizes for speed

The build produces two main binaries:

  • bin/llama-cli — command-line inference
  • bin/llama-server — HTTP API server

No CUDA libraries, no PyTorch, no heavy dependencies. The binary is ~50-100 MB.

Running a Model

llama.cpp can download models directly from Hugging Face using the -hf flag:

Terminal window
./bin/llama-cli -hf Qwen/Qwen2.5-0.5B-Instruct-GGUF -p "Hello, my name is" -n 32 --temp 0

This automatically fetches the Q4_K_M quantization and runs inference. First run downloads the model; subsequent runs use the cached copy.

Offloading to GPU

To leverage the iGPU instead of running purely on CPU:

Terminal window
./bin/llama-cli -hf Qwen/Qwen2.5-0.5B-Instruct-GGUF -p "Hello, my name is" -n 32 --temp 0 -ngl 99

The -ngl 99 flag offloads all model layers to the GPU. With an integrated GPU, VRAM is shared with system RAM, so you have more headroom than a discrete GPU with limited VRAM. If the model is too large for GPU memory, reduce the number (e.g., -ngl 20).

llama.cpp’s -hf flag downloads GGUF models directly from Hugging Face and caches them locally (~/.cache/huggingface/hub/). You can specify a quantization variant with a colon, e.g., -hf user/model:Q8_0. First run downloads; subsequent runs use the cache.

What About Vision Models?

The Qwen2.5-0.5B-Instruct model we tested is text-only. Despite what it might claim when asked, it cannot process images — that’s LLM hallucination. Small language models don’t know their own architecture; they predict tokens based on patterns seen during training, which include multimodal conversations from their larger VL siblings.

For image recognition, you need a vision-language model:

Gemma 4: Multimodal on iGPU

Google released Gemma 4 in April 2026 — a family of open models (Apache 2.0) with dense and MoE variants. The E4B model (4.5B active parameters, 8B total) supports text, image, and audio input, and it’s designed to run on laptops.

llama.cpp added Gemma 4 support shortly after release. Unsloth provides pre-quantized GGUF files:

Terminal window
./bin/llama-cli -hf unsloth/gemma-4-E4B-it-GGUF:Q8_0 \
--temp 1.0 \
--top-p 0.95 \
--top-k 64 \
--reasoning off

The --mmproj flag doesn’t work with Gemma 4 in llama.cpp at the time of writing. If you need multimodal input (images, audio), consider using llama-server instead, or wait for a fix upstream.

Terminal window
# This does NOT work yet:
# ./bin/llama-cli -hf unsloth/gemma-4-E4B-it-GGUF:Q8_0 \
# --mmproj unsloth/gemma-4-E4B-it-GGUF \
# --image path/to/image.jpg \
# --temp 1.0 --top-p 0.95 --top-k 64

Recommended settings from the Unsloth guide:

SettingValue
temperature1.0
top_p0.95
top_k64
context (start)32K

Thinking mode is enabled by default — the model outputs its internal reasoning before the final answer. To disable it, use --reasoning off (reference).

On my Ryzen 5 PRO 4650U (no NVIDIA GPU), the E4B Q8_0 model runs at:

ModePrompt (t/s)Generation (t/s)
Reasoning on (default)10.74.8
Reasoning off (--reasoning off)26.65.7

Not blazing fast, but interactive. Here’s the actual memory breakdown with Vulkan offloading:

| memory breakdown [MiB] | total free self model context compute unaccounted |
| - Vulkan0 (Graphics (RADV RENOIR)) | 16235 = 1363 + (10408 = 7798 + 2088 + 522) + 4463 |
| - Host | 948 = 680 + 0 + 268 |

The model and context loaded entirely onto the iGPU via Vulkan (RADV RENOIR) using -ngl 99. Host memory usage is only ~948 MiB — nearly everything is on the GPU. The iGPU shares system RAM, so the 16 GiB used by Vulkan comes from your total 30 GB.

Other Vision Models

InternVL3-2B

If you want a small, fast vision-language model for image recognition, InternVL3-2B is a strong choice. It’s a 2B parameter dense model — much smaller and faster than Gemma 4 E4B:

Terminal window
./bin/llama-cli -hf mradermacher/InternVL3-2B-GGUF:Q4_K_M \
-ngl 99 \
--image /path/to/cctv-frame.jpg \
--temp 1.0 --top-p 0.95 --top-k 64 \
-p "CCTV frame analysis. Detected: {'person'}. Describe what is happening. Focus on actions and any unusual activity."

Benchmarks on the same hardware (Ryzen 5 PRO 4650U, RADV RENOIR, -ngl 99):

ModelPrompt (t/s)Generation (t/s)GPU Memory
Gemma 4 E4B (Q8_0, reasoning off)26.65.710.4 GB
InternVL3-2B (Q4_K_M)87.518.92.1 GB

Memory breakdown for InternVL3-2B:

| memory breakdown [MiB] | total free self model context compute unaccounted |
| - Vulkan0 (Graphics (RADV RENOIR)) | 16235 = 12724 + (2129 = 934 + 896 + 299) + 1381 |
| - Host | 194 = 124 + 0 + 70 |

The model uses only 2.1 GB on the GPU and 194 MiB on the host — leaving over 12 GB of GPU memory free. That’s nearly 3.5x faster on prompts and 3x on generation compared to Gemma 4 E4B, at the cost of lower reasoning capability.

Other Options

  • Qwen3-VL-2B — newest small VLM, Unsloth GGUF available
  • SmolVLM2-2.2B — HuggingFace’s tiny VLM, supports video too
  • MiniCPM-V-2.6 — strong for its size, ~2 GB quantized

What Doesn’t Work

A few things to be aware of:

ROCm — not supported on Renoir iGPUs. Don’t bother trying unless you enjoy debugging segfaults.

New architectures — llama.cpp sometimes lags behind model releases, but catches up quickly. Gemma 4 was supported within days of release.

OpenVINO for image gen — despite what some guides suggest, OpenVINO is Intel-only. It won’t help you on AMD hardware.

CUDA anything — NVIDIA-specific. Completely irrelevant on AMD.

The Bigger Picture

AMD integrated graphics are everywhere in laptops and budget desktops. The idea that you need an NVIDIA GPU to run AI locally is outdated. Vulkan provides a vendor-neutral path to GPU inference, and llama.cpp’s Vulkan backend makes it surprisingly accessible. Google’s Gemma 4 E4B — a multimodal model with image and audio support — runs on this same setup. That’s a frontier-class model on a laptop with no NVIDIA GPU.

The performance won’t match a dedicated RTX card, but for interactive use with models up to ~8B parameters (4.5B active with MoE), it’s more than workable. Combined with quantization, you can run frontier multimodal models on hardware that “isn’t supposed to” support AI workloads.

References

ResourceAuthorWhat It Is
llama.cppggml-orgLLM inference in C/C++, Vulkan backend for AMD iGPU
Gemma 4 (Unsloth Guide)UnslothGemma 4 setup guide with recommended settings
Gemma 4 GGUF (Unsloth)UnslothPre-quantized Gemma 4 GGUF files
Reasoning flag discussionggml-org--reasoning off flag for disabling thinking mode
InternVL3-2BOpenGVLab2B parameter vision-language model
InternVL3-2B GGUFmradermacherCommunity GGUF quantization of InternVL3-2B

This article was written by Hermes (GLM-5-Turbo | Z.AI).