Permission-Aware RAG: Hybrid Retrieval with Secure Filtering

· 5 min read rag self-hosting

TL;DR: Permission filtering must happen inside the retrieval query, not after. Resolve user permissions once, pass them as filter predicates to both vector and BM25 indexes, merge results with Reciprocal Rank Fusion, and only send authorized chunks to the LLM.

Knowledge bases that serve multiple users or teams must answer two questions simultaneously: what is relevant and what is this user allowed to see. Getting either wrong means either poor answers or data leaks. This post lays out a concrete architecture for building a permission-aware RAG system with hybrid retrieval — vector similarity plus BM25 keyword search — where authorization is baked into the retrieval layer, not bolted on after.

Architecture Overview

flowchart TD A[Client] --> B[API Layer] B --> C[Identity Provider] B --> D[Authorization Service] B --> E[RAG Service] E --> D E --> F[Retrieval Layer] F --> G[(Vector Database)] F --> H[(Metadata Database)] H --> I[(Object Storage)] J[Documents] --> K[Parser] K --> L[Ingestion Processor] L --> G L --> H

The system separates concerns into four layers:

  • Identity — who the user is (handled by the identity provider)
  • Authorization — what they can access (resolved once per request)
  • Retrieval — what is relevant (vector + BM25 with permission filters)
  • LLM — how to answer (receives only authorized, relevant chunks)

Each layer is independently replaceable. Swap Qdrant for Pinecone, Ollama for OpenAI, or the identity provider for a different one — the retrieval logic stays the same.

Core Concept — Permission-Aware Retrieval

The single most important rule:

Permission filtering must happen inside retrieval, not after.

Filtering after retrieval is the most common security mistake in RAG systems. If you retrieve 50 chunks then filter out 30 based on permissions, you have already leaked information (the system knew about those 30 chunks) and reduced answer quality (fewer chunks reach the LLM).

Correct flow

1. Resolve permissions → user subjects
2. Apply filter in database query
3. Retrieve relevant chunks (already filtered)
4. Send to LLM

Permission resolution

When a user makes a request, the authorization service resolves their access into a flat list of subjects:

{
"user_subjects": [
"user:123",
"role:finance-admin",
"set:billing"
]
}

These subjects are passed directly to the retrieval layer as filter predicates. The retrieval layer never needs to call the authorization service again — it just checks whether any of the chunk’s permission tags intersect with the user’s subjects.

Data Model

Each chunk stored in the system carries metadata required for filtering:

{
"id": "chunk_123",
"content": "Q3 revenue was $4.2M, a 12% increase...",
"embedding": [0.012, -0.034, 0.056, "..."],
"app_id": "app_1",
"document_id": "doc_456",
"permission_tags": ["role:admin", "set:finance"]
}

Design rules

  • No joins at query time — all filter data lives on the chunk itself
  • Permissions are flattened at ingestion — the ingestion pipeline resolves document-level permissions into chunk-level tags
  • Each chunk has a single permission scope — never mix chunks with different permission requirements in the same retrieval result

The app_id field provides tenant isolation. Every query filters on it first, which reduces the search space before any vector computation happens.

Retrieval Flow

flowchart TD A[User Query] --> B[API] B --> C[Authorization] C --> D[User Subjects] B --> E[RAG Service] D --> E E --> F[Retrieval Layer] F --> G[Filter: app_id + permission_tags] G --> H[Search] H --> I[Top-K Chunks] I --> J[LLM] J --> K[Response]

The retrieval layer receives the user’s subjects and the query. It constructs a database filter that requires:

  1. app_id matches the requesting tenant
  2. permission_tags has at least one overlap with user_subjects

Only chunks passing both filters enter the similarity ranking. This means the vector index never sees unauthorized chunks, and the LLM never receives them.

Hybrid Search — Vector + BM25

Pure vector search misses exact term matches. Pure keyword search misses semantic similarity. Hybrid search runs both in parallel and merges the results.

Why both

Query typeVector searchBM25 search
”quarterly revenue” (semantic)StrongWeak
”ACME-4421” (exact term)WeakStrong
”how to configure SSO” (semantic + keywords)MediumMedium

Neither alone covers all query patterns. Running both and merging gives you the best of both worlds.

Parallel retrieval

flowchart TD A[Query] --> B[RAG Service] B --> C[Authorization] C --> D[User Subjects] B --> E[Hybrid Retrieval] D --> E E --> F[Vector Search] E --> G[BM25 Search] F --> H[(Vector DB)] G --> I[(Metadata DB)] H --> J[Merge + Rank] I --> J J --> K[Top Results] K --> L[LLM]

Both searches receive the same permission filter. Both run against the same reduced dataset (after app_id and permission filtering). The results are merged using a ranking strategy.

Ranking Strategy — Reciprocal Rank Fusion

Two common approaches for merging vector and BM25 results:

Weighted scoring

score = 0.6 * vector_score + 0.4 * bm25_score

Simple but fragile — vector and BM25 scores are on different scales, so you need to normalize them first, and the optimal weights change per dataset.

score = 1 / (k + rank_vector) + 1 / (k + rank_bm25)

Where k is a constant (typically 60). This uses only the rank position of each result, not the raw score. Three advantages:

  • No normalization needed — ranks are already on the same scale
  • Stable across datasets — no per-dataset weight tuning
  • Widely used in production — proven in information retrieval research

Implementation in Python:

def reciprocal_rank_fusion(
vector_results: list[str],
bm25_results: list[str],
k: int = 60,
limit: int = 10,
) -> list[str]:
scores: dict[str, float] = {}
for rank, doc_id in enumerate(vector_results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
for rank, doc_id in enumerate(bm25_results):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(scores, key=scores.get, reverse=True)[:limit]

Indexing Strategy

Efficient indexing is what makes permission-aware retrieval fast at scale. The key principle:

Reduce the dataset before vector ranking.

Apply app_id and permission_tags filters first (B-tree and inverted index lookups, microseconds), then run vector similarity on the reduced set. This avoids computing embeddings similarity against chunks the user cannot see.

Required indexes

Index TypePurposeSpeed
Vector index (HNSW or IVF)Semantic similarity searchMilliseconds
B-tree on app_idTenant isolation filterMicroseconds
Inverted index on permission_tagsAuthorization filterMicroseconds
Full-text index on contentBM25 keyword searchMilliseconds

The B-tree and inverted index filters run first, reducing the candidate set. Then vector search and BM25 run on the reduced set in parallel. This ordering is critical for performance at scale.

Concrete setup with Qdrant + PostgreSQL

A practical stack for this architecture:

  • Qdrant — vector search with payload filtering. Supports filtering on app_id and permission_tags directly in the vector query using must clauses.
  • PostgreSQL with pgvector — metadata storage and BM25 via tsvector columns. The same app_id and permission_tags columns are indexed with B-tree and GIN indexes.
  • Object storage (S3/MinIO) — original documents and extracted images.

Qdrant filter example:

{
"filter": {
"must": [
{ "key": "app_id", "match": { "value": "app_1" } },
{
"key": "permission_tags",
"match": { "any": ["role:admin", "set:finance"] }
}
]
},
"limit": 20
}

PostgreSQL BM25 with the same filter:

SELECT id, content, ts_rank_cd(content_tsv, query) AS rank
FROM chunks
WHERE app_id = 'app_1'
AND permission_tags && ARRAY['role:admin', 'set:finance']
AND content_tsv @@ query
ORDER BY rank DESC
LIMIT 20;

Both queries apply the same permission filter before ranking.

Ingestion Pipeline

The ingestion pipeline is where permissions get flattened onto chunks. This is a precomputation step — the query path never resolves permissions from scratch.

flowchart TD J[Documents] --> K[Parser] K --> L[Ingestion Processor] L --> M[Resolve Permissions] M --> N[Attach Tags to Chunks] N --> O[Generate Embeddings] O --> P[Store in Vector DB] O --> Q[Store in Metadata DB]

Step-by-step

  1. Parse documents — extract text, tables, images (Docling, Unstructured, or custom parser)
  2. Normalize text — clean whitespace, fix encoding, strip boilerplate headers/footers
  3. Chunk content — recursive character splitting with markdown awareness, chunk size 500-1000 tokens with 100-token overlap
  4. Resolve permissions — look up the document’s access control list and flatten it into a list of subject tags (roles, groups, user IDs)
  5. Attach permission tags — write the resolved tags onto every chunk from this document
  6. Generate embeddings — embed each chunk using the same model used at query time
  7. Store in both databases — vector embeddings in Qdrant, full text + metadata in PostgreSQL

The permission resolution step is the critical one. If a document is visible to role:finance and user:alice, every chunk from that document gets permission_tags: ["role:finance", "user:alice"]. When user:bob (who has role:finance) queries, the filter matches on the shared tag.

Re-indexing on permission changes

When a document’s permissions change, you must re-ingest all its chunks with the updated tags. This is a background job — it does not affect the query path. Store the document-permission mapping in a separate table so the ingestion pipeline can look up which documents need re-indexing when permissions change.

Multi-Tenant Design

Use app_id as a mandatory primary filter on every query:

  • Isolates tenants completely — a user from app_1 can never see chunks from app_2
  • Reduces the search space before any expensive computation
  • Enables per-tenant tuning (different embedding models, chunk sizes) if needed

For large deployments with thousands of tenants, consider separate Qdrant collections per tenant. For smaller deployments (dozens to hundreds of tenants), a single collection with app_id filtering is sufficient and simpler to manage.

Common Failure Modes

These mistakes are subtle but critical. Each one has caused real data leaks in production RAG systems.

Filtering after retrieval

Retrieving chunks first, then checking permissions in application code. This leaks information (timing side-channels, error messages) and reduces answer quality because filtered-out chunks reduce the context available to the LLM. Fix: always filter in the database query.

Per-chunk authorization calls

Calling the authorization service for each retrieved chunk to check access. This turns a single database query into N+1 network calls. At 50 chunks per query and 100ms per auth call, that is 5 seconds of overhead. Fix: resolve permissions once, pass as filter.

Mixed-permission chunks

Splitting a document into chunks where some chunks are public and others are restricted. If the LLM receives a mix, it may reveal restricted information by inference from the public chunks. Fix: every chunk from the same document must have the same permission tags.

Overly granular permissions

Assigning unique permission tags to individual chunks instead of documents. This creates huge filter arrays, increases index size, and degrades query performance. Fix: permissions live at the document level, not the chunk level.

Missing app_id filter

Forgetting to filter by tenant. In a multi-tenant system, this means any user from any tenant can potentially see chunks from other tenants if their permission tags happen to overlap. Fix: app_id is a mandatory must clause in every query.

Design Principles

  1. Separation of concerns — identity, authorization, retrieval, and LLM are independent layers. Each can be swapped without redesigning the others.
  2. Precomputation over runtime logic — flatten permissions at ingestion time, not at query time. The query path should be as simple as filter + search + rank.
  3. Secure by design — never rely on post-retrieval filtering. The database query is the security boundary.
  4. Replaceable components — the architecture works with Qdrant or Pinecone, PostgreSQL or MongoDB, Ollama or OpenAI. The permission model and ranking logic stay the same.

References

  1. RAG vs Large Context Window for AI Apps — Redis.io (February 6, 2026) — https://redis.io/blog/rag-vs-large-context-window-ai-apps/
  2. The Battle Between RAG and Long Context — Tomer Ben David, Dev.to (March 13, 2026) — https://dev.to/tomerbendavid/the-battle-between-rag-and-long-context-4ilc
  3. RAG vs Long-Context Windows: Choosing the Right LLM Architecture — Code With Yoha (February 14, 2026) — https://codewithyoha.com/blogs/rag-vs-long-context-windows-choosing-the-right-llm-architecture

This article was written by ChatGPT (GPT-5.3 | OpenAI), edited by Hermes (glm-5-turbo | zai).