Build a Self-Hosted Multimodal RAG Agent with Docling, n8n, and Ollama

· 5 min read youtube

TL;DR: Run a complete multimodal RAG pipeline entirely on your hardware — Docling extracts structured text and images from PDFs, n8n orchestrates ingestion and retrieval, Ollama serves local LLMs, and Qdrant stores embeddings. All services run in Docker Compose with no external API calls.

Every document sent to ChatGPT or Claude is a potential security liability. Legal contracts, medical records, financial statements — they all pass through a third-party service with its own data policies. For organizations handling sensitive information, the solution is full local deployment: an air-gapped RAG system where nothing leaves your infrastructure.

This guide walks through building exactly that — a multimodal RAG agent that processes PDFs with embedded images, tables, and diagrams, makes those images retrievable in chat, and runs entirely on Docker with zero external API calls.

What Is Multimodal RAG

Standard RAG retrieves text chunks from a vector store. Multimodal RAG retrieves across multiple data types — text documents, PDFs with embedded images and tables, audio transcripts, even video. The key advantage: when a user asks about a diagram in a PDF, the system retrieves and displays the actual image alongside the text, not just a text description of it.

The architecture for this guide uses five services:

  • Docling — IBM’s open-source document processing library. Feeds it PDFs, Word docs, PowerPoint, images, and audio. Outputs clean structured markdown or JSON with recognized headers, tables, bullet points, and extracted diagrams.
  • n8n — Low-code workflow orchestrator. Handles the ingestion pipeline (watch folder, process documents, store vectors) and the chat agent.
  • Ollama — Local LLM runtime. Runs the language model and embedding model.
  • Qdrant — High-performance vector store. Stores document embeddings for semantic search.
  • PostgreSQL — Database backend for n8n state.

All five run as Docker containers orchestrated by a single Docker Compose file.

Docling: Two Processing Pipelines

Docling supports two distinct document processing approaches, and understanding the tradeoff between them matters for production deployments.

Standard Pipeline

Uses specialized non-generative models — layout analysis, table structure recognition, OCR — to extract content verbatim. No hallucinations because there is no generative step. The output is an exact copy of what is in the document, preserving semantic structure (headers, tables, bullet points, diagrams as images with searchable text).

VLM Pipeline

Breaks documents into pages and batch-processes each through a Vision Language Model (like Granite Docling, SmolDocling, or Qwen VL). The VLM extracts text into a structured format. More flexible with complex layouts but introduces potential hallucinations since it uses generative AI.

For most use cases, the standard pipeline is the right starting point — verbatim extraction with zero hallucination risk. The VLM pipeline is worth exploring for documents with complex visual layouts where the standard pipeline’s OCR struggles.

Hardware Requirements

Local AI requires a GPU. Neural networks with billions of parameters need GPU VRAM to load and inference. The practical limits:

HardwareMax comfortable model sizeNotes
Nvidia RTX 4090~25-35B params~$1,600
Nvidia RTX 5090~25-35B params~$2,000
AMD Radeon~25-35B paramsVulkan ROCm support varies
Apple Silicon (M1+)~25-35B paramsCannot expose GPU to Docker

Larger models (70B+) are possible but require heavy quantization, degrading quality. Tokens per second matters — users expect ChatGPT-speed responses.

You do not need GPU hardware to build and test. Use cloud-based open-source models via Ollama Cloud or OpenRouter during development, then deploy locally when production hardware is ready.

Setting Up the Docker Stack

The starter kit is a fork of n8n’s self-hosted AI starter kit with Docling and a static file server added.

Terminal window
git clone https://github.com/theaiautomators/self-hosted-ai-starter-kit.git
cd self-hosted-ai-starter-kit
cp .env.example .env

Generate secure secrets for the .env file:

Terminal window
openssl rand -hex 16 # Postgres password
openssl rand -hex 16 # n8n encryption key

Start the stack

For Nvidia GPU:

Terminal window
docker compose --profile gpu-nvidia up -d

For AMD GPU (Linux):

Terminal window
docker compose --profile gpu-amd up -d

For CPU only (Mac, no GPU, or testing):

Terminal window
docker compose --profile cpu up

The --profile cpu flag is required to start Docling. Without it, only the core services (n8n, PostgreSQL, Qdrant) start.

Service endpoints

ServiceURLPurpose
n8nhttp://localhost:5678Workflow orchestrator
Doclinghttp://localhost:5001/uiDocument processing UI
Docling APIhttp://localhost:5001/docsSwagger API docs
Qdranthttp://localhost:6333/dashboardVector store dashboard
Static fileshttp://localhost:8080Nginx file server (images)

First-time n8n setup: open http://localhost:5678, create an owner account, then go to Settings and enter an activation key (free, everything stays local) to unlock features like pinning workflow executions.

Critical Docker networking concept

Containers are isolated. They cannot see each other’s localhost. When n8n (in a container) needs to call Docling (in another container), it must use the Docker service name, not localhost:

  • Wrong: http://localhost:5001 — searches inside the n8n container only
  • Right: http://docling:5001 — routes through the Docker network

This applies to all inter-service communication: Ollama is http://ollama:11434, Qdrant is http://qdrant:6333, and so on. Only you (on the host machine) use localhost.

Persistent volumes

The shared/ directory is bind-mounted into the n8n container at /data/shared. Files placed here survive container restarts. Inside n8n, reference this path as /data/shared/.... The starter kit also mounts volumes for n8n workflows, Ollama models, Qdrant data, and Postgres.

Building the RAG Ingestion Pipeline

Step 1: Create folder structure

Under the shared/ directory, create:

shared/
rag-files/
pending/ # Drop documents here for processing
processed/ # Completed documents move here
extracted-images/ # Images served by Nginx at port 8080

Step 2: Watch for new files

In n8n, create a new workflow and add a Local File Trigger node. Configure it to watch /data/shared/rag-files/pending. If the trigger does not detect files automatically, switch to polling mode in the node settings.

Step 3: Read the file

Add a Read/Write Files from Disk node. Set the file path from the trigger output (the binary file path).

Step 4: Process with Docling

Add an HTTP Request node to call Docling’s synchronous conversion endpoint:

  • Method: POST
  • URL: http://docling:5001/v1a/convert (note: docling, not localhost)
  • Body: form-data with key files (type: n8n binary file, value: data)

Docling API parameters worth configuring:

  • image_export_mode:
    • placeholder — replaces images with the text “image” (loses images)
    • referenced — saves extracted images to disk and returns filenames in the markdown

Use image_export_mode=referenced for multimodal RAG. The images are saved to the Docling scratch directory (configured via the Docker Compose working directory).

For large documents (100+ pages), use the async endpoint instead (/v1a/convert/async). It returns a task ID immediately, then poll /v1a/tasks/{task_id} until status is success, then fetch the result.

Step 5: Extract and move images

Add a Code node (JavaScript) to parse the Docling output and extract image filenames. Then use a Split Out node to iterate over each image. Finally, use an Execute Command node to move each image from the Docling scratch directory to the extracted-images folder:

Terminal window
mv /data/shared/docling-scratch/<filename> /data/shared/extracted-images/<filename>

Step 6: Inject full image URLs into markdown

The Docling output contains relative image paths. For the chat agent to display them, inject the full Nginx URL. Add another Code node with a regex replacement:

// Replace relative image paths with full Nginx URLs
const markdown = $input.first().json.markdown_content;
const updated = markdown.replace(
/!\[(.*?)\]\((.*?)\)/g,
'![$1](http://localhost:8080/$2)'
);
return [{ json: { markdown_content: updated } }];

Step 7: Store in Qdrant

Add a Qdrant Vector Store node configured to add documents:

  1. Create a credential pointing to http://qdrant:6333 (no API key needed for local)
  2. Create a collection in Qdrant dashboard:
    • Name: multimodal-rag
    • Dimensions: 768 (matches nomic-embed-text v1.5)
    • Distance metric: Cosine
  3. Set the embedding model to Ollamanomic-embed-text (pull it first: docker exec -it ollama ollama pull nomic-embed-text)
  4. Use a recursive character text splitter with markdown separator, chunk size ~700

Step 8: Move processed files

Add an Execute Command node to move the source file from pending/ to processed/:

Terminal window
mv /data/shared/rag-files/pending/<filename> /data/shared/rag-files/processed/<filename>

Building the AI Agent

Step 1: Add a chat trigger

Add a Chat Trigger node to the same workflow (or a separate one).

Step 2: Add an AI agent

Connect the chat trigger to an AI Agent node. Configure:

  • Model: Ollama → llama3.2 (default, 3B params — small but functional for testing)
  • Tool: Qdrant vector store → collection multimodal-rag, limit 5 results
  • Embedding model: Ollama → nomic-embed-text (must match what was used for ingestion)
  • System prompt:
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
You must output images in markdown format using the URL provided in the retrieved results.

Step 3: Test

Click “Chat” on the chat trigger node and ask questions like “Show me the cabinet opening diagram” or “How do I use the ice and water dispenser?” The agent should query Qdrant, retrieve relevant chunks with image URLs, and render them inline.

Upgrading the model

Llama 3.2 (3B) works for testing but struggles with reliable instruction following. For production, pull a larger model:

Terminal window
# Inside the Ollama container
docker exec -it ollama ollama pull llama3.1:8b
# Or for 20B params (needs ~12GB VRAM):
docker exec -it ollama ollama pull qwen2.5:14b

For testing without local GPU, use Ollama Cloud (create an API key at ollama.com) or OpenRouter. Both provide access to larger open-source models. Configure a new Ollama credential in n8n pointing to the cloud endpoint, then select the model.

Creating the Chat Frontend

The simplest approach is embedding n8n’s built-in chat widget:

  1. In the workflow with the chat trigger, toggle Settings → Public and set the chat URL (e.g., /chat/local-rag)
  2. Select “Embedded chat” mode, disable authentication (it is local)
  3. Create an HTML file in the shared/extracted-images/ directory (the Nginx root) that embeds the widget:
<!DOCTYPE html>
<html>
<head>
<title>Local RAG Agent</title>
</head>
<body>
<script src="http://n8n:5678/embed.js"></script>
</body>
</html>
  1. Update the Nginx configuration in Docker Compose to serve this HTML as the default index. In the Nginx config, add index chat.html; so http://localhost:8080 loads the chat interface directly.

  2. Activate the workflow.

Deploying to Your Local Network

To make the chat accessible to other machines on the network:

  1. Find your server’s local IP address (e.g., 192.168.1.100)
  2. Other machines access the chat at http://192.168.1.100:8080
  3. Open firewall ports 8080 (chat) and 5678 (n8n admin) for inbound connections
  4. Set a static IP on the server so the address does not change on reboot
  5. For larger organizations, work with the network team on DNS, subnets, and security policies

The server needs to stay running during the hours users need access. For production, this means a dedicated machine (or VM) that is always on.

Docling API Quick Reference

EndpointMethodPurpose
/v1a/convertPOSTSynchronous document conversion
/v1a/convert/asyncPOSTAsync conversion (returns task ID)
/v1a/tasks/{task_id}GETPoll async task status
/v1a/picture-descriptionPOSTAnnotate images with VLM descriptions

Key parameters for /v1a/convert:

  • files (array of binary) — documents to process
  • image_export_modeembedded (base64), placeholder, or referenced (save to disk)
  • pipeline_opts.pipelinestandard or vlm

Qdrant API Quick Reference

EndpointMethodPurpose
/collections/{name}DELETEDelete a collection (destructive, useful during development)
/collections/{name}PUTCreate a collection
/collections/{name}/pointsPUTUpload vectors

Gotchas and Pitfalls

  • Docker service names vs localhost: The number one error when building this system. Containers cannot reach each other via localhost. Always use the service name defined in Docker Compose (docling, ollama, qdrant).
  • Ollama model persistence: Models pulled with docker exec persist because the starter kit mounts a volume for Ollama data. Destroying and recreating the container does not require re-pulling.
  • n8n workflow persistence: The n8n container also has a dedicated volume. Workflows survive container restarts.
  • Docling image paths in markdown: Docling outputs relative paths. You must inject the full Nginx URL before storing in Qdrant, otherwise the chat agent cannot render images.
  • Small models and tool calling: Models under ~7B parameters may struggle to reliably call tools and output full URLs. Test with a larger model before deploying to production.
  • Async for large documents: The synchronous Docling endpoint blocks until processing completes. For 100+ page PDFs, use the async endpoint with a polling loop in n8n.
  • Mac GPU limitation: Apple Silicon cannot expose the GPU to Docker containers. Run Ollama natively on macOS and set OLLAMA_HOST=host.docker.internal:11434 in .env, then update the Ollama credential URL in n8n to http://host.docker.internal:11434/.

References

  1. Self-Hosted AI Starter Kit (with Docling) — The AI Automators, GitHub — https://github.com/theaiautomators/self-hosted-ai-starter-kit
  2. “Building a Production-Grade Multimodal RAG System (Fully Local)” — The AI Automators, YouTube (December 15, 2025) — https://www.youtube.com/watch?v=bankdPmQnHU
  3. Docling Documentation — IBM — https://www.docling.ai/
  4. Ollama Vision Modelshttps://ollama.com/
  5. Qdrant Vector Databasehttps://qdrant.tech/
  6. n8n Self-Hosted AI Starter Kit (original) — n8n-io, GitHub — https://github.com/n8n-io/self-hosted-ai-starter-kit

This article was written by Hermes (glm-5-turbo | zai), based on content from: https://www.youtube.com/watch?v=bankdPmQnHU