How are people ensuring secure access to RAG data

I am curious to know the patterns people are using to apply RBAC to RAG data in the vector DBs. Most of the vector DBs does not offer any row level security to RAG data, the only way that I know is to do metadata filtering. I would be interested to know how others are implementing RBAC on the RAG data in vector dbs.

1 Like

This is a great question and one I’ve personally agonised over

My Chatbot (incidentally the first ever AI Chatbot on the Discourse platform) can talk in public channels so you have to be very careful it doesn’t leak privileged information to third parties.

I’ve ended up implementing two optional solutions:

  • Bot sees whatever a specified set benchmark user sees
  • Information the bot is able to see is limited by an explicit Category scope.

This determines what is embedded and then it is assumed the bot can retrieve whatever is embedded. That way I can perform split second matches easily.

So essentially I’m being careful only to handle more or less “public” information.

My bot can also have one to one chat where arguably you could share privileges with the bot but I’ve yet to implement a purely one to one RBAC system. This obviously gets complex fast and reduces the performance of the data retrieval for all scenarios if you leverage the same table.

I may return to this one day but surprisingly it’s not been a priority for me or the community using the plugin.

1 Like

If your RAG data doesn’t change too fast, you could build a little compressed file for each user periodically. This file gets read-in at runtime in memory for search, specific to the user, and contains the vectors, metadata, and possibly hashes/indices into the DB counting the text, unless you want to store that in memory too.

The faster the RAG data changes, the more your infrastructure has to update. So updating once a minute for each user could get prohibitive, but changing once a day would work. Or change once a new kilobyte/megabyte/etc becomes available to the user.

Just a function of resources, firehose information rate, and what’s really needed from the users perspective.

1 Like

You could also perform an insecure global search then filter the results “in post”.

1 Like

The only thing I don’t like about the global search is, if you have lots of data, would be all the resources expended for one user. Assuming the user’s data is a tiny fraction of the big pile of data.

On the flip side …

If all we are talking is you have, say 100k users, and each user can see 10% of the data. Then I think it’s cleaner to do the global search then filter.

The resource overhead to each user would outweigh the advantage of allocating separate 10% isolated chunks to each user.

So where you live on the percent data a user can see vs. how many users, will totally make the decision for you. :sweat_smile:

Your Discourse example certainly qualifies for global search to be the cleaner option. :rofl:

1 Like

I guess it depends on your indexing and table size. My vector searches use in-memory HNSW graph search and queries take milliseconds.

I think a bigger problem comes when you also need to add keyword hybrid capability. :sweat_smile:

1 Like

Don’t you just save the data as you generally would in your DB, and just use the vectors for identifying which chunks to grab? It sends back the IDs and then you just take the text that matches and add it in.

That’s what I do.

That’s not RBAC (Role Based Access Control)

The OP is considering the dimensions of user access privileges where users have different levels of access to stored information.

2 Likes

What’s wrong with Metadata Filtering? I start with a platform that already has role-based access control (Drupal CMS), use Groups and Taxonomy to apply it, and mirror those classifications in the embeddings metadata. So, each query is, in fact, filtered by the access privileges of the user in the system. Works beautifully.

That is not a problem per se, but instead of me putting that filter everytime, can we not have a row level access policy like what we have in traditional dbs. For example, I can create a policy

select cosine_similarity from table
when role =metadata[“role”]
return text
else return null

If I can create a policy like this, I do not have to add the filter with every query.

I’m sorry, but what you are doing is exactly what I refer to as “filtering”.

Currently, I check access like this:

			// This is where permissions check goes
			// Check $context which is an array
			// We will be looking for $context_item['nid']
			// We need object type = file, node or comment
			// And object public = 'Y' or 'N'
			if (!empty($solrResults)) {
				$solrResults = $this->solrai_checkAccess($solrResults);
			}

This is code from my “cosine similarity” search query in Weaviate.

  $query = '
  {
    Get {
  	' . $className . ' (
  	  limit: ' . $this->limit . '
  	  nearText: {
  		concepts: ["' . $concept . '"],
  		distance: ' . $this->distance . '
  	  }
  	  where: {
  		operator: And,
  		operands: [
  		  { path: ["site"], operator: Equal, valueText:"' . $baseSite . '"}'
  		  . (!empty($checkedGroups) ? ',' . $this->solraiService->solrai_addOperands('groups', 'Equal', $checkedGroups, 'Or') : '')
  		  . (!empty($this->keywords) ? ',' . '{
  			  operator: Or,
  			  operands: [
  				' . $this->solraiService->solrai_addOperands('content', 'Like', $this->keywords, 'And') . ',
  				' . $this->solraiService->solrai_addOperands('title', 'Like', $this->keywords, 'And') . ',
  				' . $this->solraiService->solrai_addOperands('summary', 'Like', $this->keywords, 'And') . '
  			  ]
  			}' : '')
  		  . (!empty($checkedTags) ? ',' . $this->solraiService->solrai_addOperands('taxonomy', 'Equal', $checkedTags, 'Or') : '') . '
  		]
  	  }
  	){

My plan is to use “groups” to identify the “roles” a user can/cannot access in this query. The exact same way you’ve done it above. I can do that because my system knows what group(s) any user has access to.