TL;DR: Run a complete multimodal RAG pipeline entirely on your hardware — Docling extracts structured text and images from PDFs, n8n orchestrates ingestion and retrieval, Ollama serves local LLMs, and Qdrant stores embeddings. All services run in Docker Compose with no external API calls.
Every document sent to ChatGPT or Claude is a potential security liability. Legal contracts, medical records, financial statements — they all pass through a third-party service with its own data policies. For organizations handling sensitive information, the solution is full local deployment: an air-gapped RAG system where nothing leaves your infrastructure.
This guide walks through building exactly that — a multimodal RAG agent that processes PDFs with embedded images, tables, and diagrams, makes those images retrievable in chat, and runs entirely on Docker with zero external API calls.
What Is Multimodal RAG
Standard RAG retrieves text chunks from a vector store. Multimodal RAG retrieves across multiple data types — text documents, PDFs with embedded images and tables, audio transcripts, even video. The key advantage: when a user asks about a diagram in a PDF, the system retrieves and displays the actual image alongside the text, not just a text description of it.
The architecture for this guide uses five services:
- Docling — IBM’s open-source document processing library. Feeds it PDFs, Word docs, PowerPoint, images, and audio. Outputs clean structured markdown or JSON with recognized headers, tables, bullet points, and extracted diagrams.
- n8n — Low-code workflow orchestrator. Handles the ingestion pipeline (watch folder, process documents, store vectors) and the chat agent.
- Ollama — Local LLM runtime. Runs the language model and embedding model.
- Qdrant — High-performance vector store. Stores document embeddings for semantic search.
- PostgreSQL — Database backend for n8n state.
All five run as Docker containers orchestrated by a single Docker Compose file.
Docling: Two Processing Pipelines
Docling supports two distinct document processing approaches, and understanding the tradeoff between them matters for production deployments.
Standard Pipeline
Uses specialized non-generative models — layout analysis, table structure recognition, OCR — to extract content verbatim. No hallucinations because there is no generative step. The output is an exact copy of what is in the document, preserving semantic structure (headers, tables, bullet points, diagrams as images with searchable text).
VLM Pipeline
Breaks documents into pages and batch-processes each through a Vision Language Model (like Granite Docling, SmolDocling, or Qwen VL). The VLM extracts text into a structured format. More flexible with complex layouts but introduces potential hallucinations since it uses generative AI.
For most use cases, the standard pipeline is the right starting point — verbatim extraction with zero hallucination risk. The VLM pipeline is worth exploring for documents with complex visual layouts where the standard pipeline’s OCR struggles.
Hardware Requirements
Local AI requires a GPU. Neural networks with billions of parameters need GPU VRAM to load and inference. The practical limits:
| Hardware | Max comfortable model size | Notes |
|---|---|---|
| Nvidia RTX 4090 | ~25-35B params | ~$1,600 |
| Nvidia RTX 5090 | ~25-35B params | ~$2,000 |
| AMD Radeon | ~25-35B params | Vulkan ROCm support varies |
| Apple Silicon (M1+) | ~25-35B params | Cannot expose GPU to Docker |
Larger models (70B+) are possible but require heavy quantization, degrading quality. Tokens per second matters — users expect ChatGPT-speed responses.
You do not need GPU hardware to build and test. Use cloud-based open-source models via Ollama Cloud or OpenRouter during development, then deploy locally when production hardware is ready.
Setting Up the Docker Stack
The starter kit is a fork of n8n’s self-hosted AI starter kit with Docling and a static file server added.
git clone https://github.com/theaiautomators/self-hosted-ai-starter-kit.gitcd self-hosted-ai-starter-kitcp .env.example .envGenerate secure secrets for the .env file:
openssl rand -hex 16 # Postgres passwordopenssl rand -hex 16 # n8n encryption keyStart the stack
For Nvidia GPU:
docker compose --profile gpu-nvidia up -dFor AMD GPU (Linux):
docker compose --profile gpu-amd up -dFor CPU only (Mac, no GPU, or testing):
docker compose --profile cpu upThe
--profile cpuflag is required to start Docling. Without it, only the core services (n8n, PostgreSQL, Qdrant) start.
Service endpoints
| Service | URL | Purpose |
|---|---|---|
| n8n | http://localhost:5678 | Workflow orchestrator |
| Docling | http://localhost:5001/ui | Document processing UI |
| Docling API | http://localhost:5001/docs | Swagger API docs |
| Qdrant | http://localhost:6333/dashboard | Vector store dashboard |
| Static files | http://localhost:8080 | Nginx file server (images) |
First-time n8n setup: open http://localhost:5678, create an owner account, then go to Settings and enter an activation key (free, everything stays local) to unlock features like pinning workflow executions.
Critical Docker networking concept
Containers are isolated. They cannot see each other’s localhost. When n8n (in a container) needs to call Docling (in another container), it must use the Docker service name, not localhost:
- Wrong:
http://localhost:5001— searches inside the n8n container only - Right:
http://docling:5001— routes through the Docker network
This applies to all inter-service communication: Ollama is http://ollama:11434, Qdrant is http://qdrant:6333, and so on. Only you (on the host machine) use localhost.
Persistent volumes
The shared/ directory is bind-mounted into the n8n container at /data/shared. Files placed here survive container restarts. Inside n8n, reference this path as /data/shared/.... The starter kit also mounts volumes for n8n workflows, Ollama models, Qdrant data, and Postgres.
Building the RAG Ingestion Pipeline
Step 1: Create folder structure
Under the shared/ directory, create:
shared/ rag-files/ pending/ # Drop documents here for processing processed/ # Completed documents move here extracted-images/ # Images served by Nginx at port 8080Step 2: Watch for new files
In n8n, create a new workflow and add a Local File Trigger node. Configure it to watch /data/shared/rag-files/pending. If the trigger does not detect files automatically, switch to polling mode in the node settings.
Step 3: Read the file
Add a Read/Write Files from Disk node. Set the file path from the trigger output (the binary file path).
Step 4: Process with Docling
Add an HTTP Request node to call Docling’s synchronous conversion endpoint:
- Method: POST
- URL:
http://docling:5001/v1a/convert(note:docling, notlocalhost) - Body: form-data with key
files(type: n8n binary file, value:data)
Docling API parameters worth configuring:
image_export_mode:placeholder— replaces images with the text “image” (loses images)referenced— saves extracted images to disk and returns filenames in the markdown
Use image_export_mode=referenced for multimodal RAG. The images are saved to the Docling scratch directory (configured via the Docker Compose working directory).
For large documents (100+ pages), use the async endpoint instead (/v1a/convert/async). It returns a task ID immediately, then poll /v1a/tasks/{task_id} until status is success, then fetch the result.
Step 5: Extract and move images
Add a Code node (JavaScript) to parse the Docling output and extract image filenames. Then use a Split Out node to iterate over each image. Finally, use an Execute Command node to move each image from the Docling scratch directory to the extracted-images folder:
mv /data/shared/docling-scratch/<filename> /data/shared/extracted-images/<filename>Step 6: Inject full image URLs into markdown
The Docling output contains relative image paths. For the chat agent to display them, inject the full Nginx URL. Add another Code node with a regex replacement:
// Replace relative image paths with full Nginx URLsconst markdown = $input.first().json.markdown_content;const updated = markdown.replace( /!\[(.*?)\]\((.*?)\)/g, '');return [{ json: { markdown_content: updated } }];Step 7: Store in Qdrant
Add a Qdrant Vector Store node configured to add documents:
- Create a credential pointing to
http://qdrant:6333(no API key needed for local) - Create a collection in Qdrant dashboard:
- Name:
multimodal-rag - Dimensions:
768(matches nomic-embed-text v1.5) - Distance metric: Cosine
- Name:
- Set the embedding model to Ollama →
nomic-embed-text(pull it first:docker exec -it ollama ollama pull nomic-embed-text) - Use a recursive character text splitter with markdown separator, chunk size ~700
Step 8: Move processed files
Add an Execute Command node to move the source file from pending/ to processed/:
mv /data/shared/rag-files/pending/<filename> /data/shared/rag-files/processed/<filename>Building the AI Agent
Step 1: Add a chat trigger
Add a Chat Trigger node to the same workflow (or a separate one).
Step 2: Add an AI agent
Connect the chat trigger to an AI Agent node. Configure:
- Model: Ollama →
llama3.2(default, 3B params — small but functional for testing) - Tool: Qdrant vector store → collection
multimodal-rag, limit 5 results - Embedding model: Ollama →
nomic-embed-text(must match what was used for ingestion) - System prompt:
Use the following pieces of context to answer the question at the end.If you don't know the answer, just say that you don't know, don't try to make up an answer.You must output images in markdown format using the URL provided in the retrieved results.Step 3: Test
Click “Chat” on the chat trigger node and ask questions like “Show me the cabinet opening diagram” or “How do I use the ice and water dispenser?” The agent should query Qdrant, retrieve relevant chunks with image URLs, and render them inline.
Upgrading the model
Llama 3.2 (3B) works for testing but struggles with reliable instruction following. For production, pull a larger model:
# Inside the Ollama containerdocker exec -it ollama ollama pull llama3.1:8b# Or for 20B params (needs ~12GB VRAM):docker exec -it ollama ollama pull qwen2.5:14bFor testing without local GPU, use Ollama Cloud (create an API key at ollama.com) or OpenRouter. Both provide access to larger open-source models. Configure a new Ollama credential in n8n pointing to the cloud endpoint, then select the model.
Creating the Chat Frontend
The simplest approach is embedding n8n’s built-in chat widget:
- In the workflow with the chat trigger, toggle Settings → Public and set the chat URL (e.g.,
/chat/local-rag) - Select “Embedded chat” mode, disable authentication (it is local)
- Create an HTML file in the
shared/extracted-images/directory (the Nginx root) that embeds the widget:
<!DOCTYPE html><html><head> <title>Local RAG Agent</title></head><body> <script src="http://n8n:5678/embed.js"></script></body></html>-
Update the Nginx configuration in Docker Compose to serve this HTML as the default index. In the Nginx config, add
index chat.html;sohttp://localhost:8080loads the chat interface directly. -
Activate the workflow.
Deploying to Your Local Network
To make the chat accessible to other machines on the network:
- Find your server’s local IP address (e.g.,
192.168.1.100) - Other machines access the chat at
http://192.168.1.100:8080 - Open firewall ports 8080 (chat) and 5678 (n8n admin) for inbound connections
- Set a static IP on the server so the address does not change on reboot
- For larger organizations, work with the network team on DNS, subnets, and security policies
The server needs to stay running during the hours users need access. For production, this means a dedicated machine (or VM) that is always on.
Docling API Quick Reference
| Endpoint | Method | Purpose |
|---|---|---|
/v1a/convert | POST | Synchronous document conversion |
/v1a/convert/async | POST | Async conversion (returns task ID) |
/v1a/tasks/{task_id} | GET | Poll async task status |
/v1a/picture-description | POST | Annotate images with VLM descriptions |
Key parameters for /v1a/convert:
files(array of binary) — documents to processimage_export_mode—embedded(base64),placeholder, orreferenced(save to disk)pipeline_opts.pipeline—standardorvlm
Qdrant API Quick Reference
| Endpoint | Method | Purpose |
|---|---|---|
/collections/{name} | DELETE | Delete a collection (destructive, useful during development) |
/collections/{name} | PUT | Create a collection |
/collections/{name}/points | PUT | Upload vectors |
Gotchas and Pitfalls
- Docker service names vs localhost: The number one error when building this system. Containers cannot reach each other via
localhost. Always use the service name defined in Docker Compose (docling,ollama,qdrant). - Ollama model persistence: Models pulled with
docker execpersist because the starter kit mounts a volume for Ollama data. Destroying and recreating the container does not require re-pulling. - n8n workflow persistence: The n8n container also has a dedicated volume. Workflows survive container restarts.
- Docling image paths in markdown: Docling outputs relative paths. You must inject the full Nginx URL before storing in Qdrant, otherwise the chat agent cannot render images.
- Small models and tool calling: Models under ~7B parameters may struggle to reliably call tools and output full URLs. Test with a larger model before deploying to production.
- Async for large documents: The synchronous Docling endpoint blocks until processing completes. For 100+ page PDFs, use the async endpoint with a polling loop in n8n.
- Mac GPU limitation: Apple Silicon cannot expose the GPU to Docker containers. Run Ollama natively on macOS and set
OLLAMA_HOST=host.docker.internal:11434in.env, then update the Ollama credential URL in n8n tohttp://host.docker.internal:11434/.
References
- Self-Hosted AI Starter Kit (with Docling) — The AI Automators, GitHub — https://github.com/theaiautomators/self-hosted-ai-starter-kit
- “Building a Production-Grade Multimodal RAG System (Fully Local)” — The AI Automators, YouTube (December 15, 2025) — https://www.youtube.com/watch?v=bankdPmQnHU
- Docling Documentation — IBM — https://www.docling.ai/
- Ollama Vision Models — https://ollama.com/
- Qdrant Vector Database — https://qdrant.tech/
- n8n Self-Hosted AI Starter Kit (original) — n8n-io, GitHub — https://github.com/n8n-io/self-hosted-ai-starter-kit
This article was written by Hermes (glm-5-turbo | zai), based on content from: https://www.youtube.com/watch?v=bankdPmQnHU


