Table of Contents
How does a Retrieval-Augmented Generation (RAG) pipeline work?
Retrieval-Augmented Generation (RAG) has revolutionized how enterprises leverage Large Language Models (LLMs). By enriching LLMs with proprietary data, RAG pipelines turn generic models into powerful internal experts capable of answering complex questions about operations, codebases, and customer interactions. However, this capability introduces the critical security challenge of data leakage, the unintentional exposure of restricted information with individuals lacking the proper permissions.
Intricate, dynamic, and hierarchical access control policies govern enterprise data sources. If your RAG pipeline bypasses these controls, an LLM could share privileged customer data with an unauthorized user.
The risk is significant: data breaches, even within internal environments, can result in hefty fines and reputational damage. Yet, organizations don’t want to introduce latency and risk errors through excessive approaches that slow the pipeline down. This is why deploying RAG safely requires a robust, fine-grained authorization strategy that respects existing permissions without sacrificing performance.
In this post, we will explore:
What a RAG Pipeline is and why it is crucial to secure them properly.
Why traditional pre-filtering techniques don’t work at scale on the Vector DBs that are typically used in a RAG pipeline.
Lastly, how a post-retrieval filtering pattern using a Zanzibar-inspired service (Relationship-Based Access Control/Fine Grained Authorization) is the optimal solution.
It’s recommended that the reader be acquainted with fine-grained authorization (FGA) concepts to fully grasp the challenges and patterns laid out in this blog. An educational overview of FGA can be found here and Descope’s documentation on implementing FGA can be found here.
How does a Retrieval-Augmented Generation (RAG) pipeline work?
At a high level, a RAG pipeline orchestrates the flow of information between a user's query and an LLM to provide context-aware answers. The standard flow involves four stages:
Query: A user submits a prompt.
Retrieve: The system converts the query into a vector embedding (a numerical representation of its meaning) and searches a Vector Database to find the most semantically similar document chunks (the
top_kresults).Augment: The original prompt and the retrieved context are combined into a single input.
Generate: The LLM processes the augmented prompt to produce the final answer.
The vulnerability lies in the Retrieve phase. Think of the Vector DB as a super-efficient librarian who can find conceptually similar paragraphs instantly. However, this librarian is focused purely on relevance, not clearance levels. The retrieval mechanism is mathematical; it finds neighbors in the embedding space and has no inherent understanding of enterprise authorization policies.
Why pre-filtering scales poorly
The intuitive solution to RAG security is pre-filtering: attempting to filter the data before or during the vector search by embedding permission information into the vector metadata.
In theory, you could add metadata like allowed_groups: ["finance"] to each vector chunk. When a user queries the database, the application adds the user's group memberships to the query filter. While this might work for very simple and static Role-Based Access Control (RBAC) policies, it quickly breaks down in real-world enterprise environments at scale:
1. Synchronization lag
Permissions in an enterprise organization are dynamic. Users change roles, join projects, and share or restrict documents in real time. Authorization systems must instantly reflect these changes.
Metadata filtering requires you to sync permissions from your source of truth into the vector database, which introduces latency. If the vector database only syncs periodically, permission changes in the source data aren't immediately reflected. This gap creates a significant security vulnerability where users might retain access long after it should have been revoked.
2. Metadata explosion and complexity
Enterprise permissions are rarely simple. They involve nested groups, folder hierarchies, ownership, and specific sharing relationships (e.g., "Alice can view this document because Bob shared it with her, and Bob owns the folder it resides in").
Attempting to flatten these complex, graph-like relationships (ReBAC) into simple key-value metadata tags (like ACLs or RBAC roles) is incredibly difficult. It leads to a metadata explosion, increasing storage requirements and complexity.
If a user belongs to hundreds of groups, including all those identifiers in the metadata filter can exceed query size limits or severely degrade performance, especially when someone's role or organization changes.
3. Performance overhead
Vector databases are optimized for Approximate Nearest Neighbor (ANN) search. Forcing the database to also evaluate complex metadata filters against every potential match significantly degrades query performance.
The database must now reconcile two competing goals: finding the nearest semantic neighbors and satisfying the complex authorization constraints. This adds latency, reducing the responsiveness of the LLM application.
Post-retrieval filtering with an FGA service
The most robust and scalable approach to securing RAG pipelines is to decouple authorization from retrieval. This strategy is known as post-retrieval filtering.
Instead of trying to force the Vector DB to handle permissions—a job it's ill-suited for—we leverage a dedicated and high-performance authorization service to filter the results after they are retrieved but before they are sent to the LLM. Here is the secure architectural pattern:
Retrieve broadly: The application queries the Vector DB based purely on semantic similarity, requesting the top N relevant document chunks, without any permission filters.
Check authorization: The application takes the resulting set of document IDs (the "candidate set") and performs a real-time authorization check for the current user against a centralized authorization service.
Filter: Only the documents the user is authorized to "view" are retained.
Augment and generate: The authorized documents are passed to the LLM as context for generating the final response.
How to handle starvation, the re-query loop
A common challenge with post-filtering is the "insufficient results" problem, also referred to as “starvation”. Suppose the LLM requires 5 documents (k=5) for context. You query the Vector DB for the top 5 results, but the user is only authorized to see 1 of them. The LLM will lack sufficient information.
To solve this issue, we implement an iterative re-query loop using pagination (offsets): If the number of authorized documents is less than some threshold, the system automatically re-queries the Vector DB for the next batch of results (e.g., results 20-40) and repeats the filtering process until the target k is met or a predefined safety limit is reached. This ensures we provide maximum allowable context without re-processing the same unauthorized documents.
Note: It's often wise to over-sample the initial query—that is, set the top_k value to something higher than the LLM needs to make it more likely to hit the target without having to re-query. Re-queries, particularly those involving pagination, will add latency and overhead to the RAG pipeline.

Why ReBAC (Zanzibar) is essential
For this pattern to work effectively, the authorization service must be extremely fast and capable of handling complex policies. A slow permission check would negate the speed benefits of the vector database.
This is where Relationship-Based Access Control (ReBAC), inspired by Google's Zanzibar paper, excels. A Zanzibar-style service, like the one offered by Descope, is designed specifically for this challenge:
Performance at scale: Zanzibar-style systems are built for massive scale and low latency, utilizing sophisticated caching and parallel graph traversal algorithms. This ensures the authorization checks don't become a bottleneck (ReBAC systems have very high request rates).
Flexibility: ReBAC models permissions as relationships (e.g., "User is member of Group"; "Group is viewer of Folder"). This allows for expressing complex scenarios like inheritance, ownership, and dynamic sharing that are impossible to represent in static metadata.
Real-time accuracy: Checks happen at query time against the central source of truth, ensuring that stale data never compromises security.
Securing RAG results without sacrificing performance
As organizations rush to adopt RAG to unlock the value of their enterprise data, security cannot be an afterthought. The potential for data leakage through LLM interfaces is significant if permissions are not strictly enforced.
While pre-filtering seems intuitive, it quickly succumbs to the complexity of enterprise permissions—leading to performance issues, synchronization challenges, and critical security gaps.
The optimal architecture is post-retrieval filtering. By decoupling semantic search from authorization, organizations can leverage the speed of Vector Databases and the robust security of a Zanzibar-style ReBAC system. This approach provides the best combination of security, scalability, and flexibility, ensuring that users only access the data they are authorized to see.
To learn more about how Descope can help organizations implement fine-grained authorization, check out our documentation and watch this video example within an IoT device context. Ready to get started? Sign up for a free account or book a demo with our team.


