RAG on private dataset via LangChain, does OpenAI / ChatGPT get access to the documents?

Hi, I am planning to use the RAG (Retrieval Augmented Generation) approach for developing a Q&A solution with GPT. In this approach, I will convert a private wiki of documents into OpenAI / tiktoken embeddings and store in a vector DB (Pinecone). During prompting, I will retrieve similar documents from the DB, and pass that to the prompt as additional context. Will my documents be exposed to ChatGPT? Or do they stay private? I’d like to avoid leaking my Document data to OpenAI / ChatGPT.
Specifically this langchain python code as an example:

query = "What is the meaning of life?"
docs = pinecone.similarity_search(query, include_metadata=True)
llm = OpenAI(temperature=0, model_name='text-davinci-003')
print(llm(query))
chain = load_qa_chain(llm, chain_type='stuff')
response = chain.run(input_documents=docs, question=query)
print(response)

Context: Question Answering Over Documents | 🦜️🔗 LangChain

4 Likes

yes your documents are read by openai
also pinecone if you store the plain text in metadata

4 Likes

Thanks for the info. What approach should be taken to keep it truly private? Use an alternative open source LLM? Is there an API option that you can use to tell OpenAI to not read it?

For your example LangChain usage, data will be sent 1) to OpenAI to create the embeddings, 2) possibly to Pinecode if you include the content in metadata, and 3) to OpenAI when the content is stuffed into the prompt. Per their policy they don’t use it for training and don’t retain the data, so up to you to decide if that is sufficient for your data.

  1. OpenAI will not use data submitted by customers via our API to train or improve our models, unless you explicitly decide to share your data with us for this purpose.
  2. Any data sent through the API will be retained for abuse and misuse monitoring purposes for a maximum of 30 days, after which it will be deleted (unless otherwise required by law).

If you don’t want to send the data to a third-party API you’re going to need to find a replacement library for calculating the embeddings, storing the vectors, and a LLM.

3 Likes

Check this video (Talk to YOUR DATA without OpenAI APIs: LangChain - YouTube) they present you a solution with free tools but not sure your will keep your data inside your premises. If you finally find the way to do it, please share it.

1 Like

Thanks for the responses everyone, I’m going to research a bit more and see if there’s a truly private approach. I’ll report back here. Thanks!

If you are using a third party as a proxy for API calls into OpenAI and the third party has decided to opt-in its data for API calls (which is possible), then your assumption that API calls are not using your data is wrong.

Any party issuing API calls to OpenAI on your behalf is free to modify the default opt-out policy of OpenAI. They might do it unwittingly or intentionally to cut a deal to lower inferencing costs. You must read their own privacy policy to understand if there is a risk.

3 Likes

You could do all this using Azure OpenAI which would meet your security concerns.

Depending on what wiki site your using - Mantium has a Notion connector (along with pdf, docx, etc) that would automate your pipelines. Here’s a video Full Tutorial: Chat with your Data Using OpenAI ChatGPT Plugins and Mantium - YouTube

1 Like

I don’t know if this will solve your issue. But one way I’ve found to interact with data using Chat GPT without ever sending anything but table headers, is using CHAT GPT only to draft the query, then running it on your system to get the output. Any follow up questions can be based on the query and not the output data. Meaning data never leaves your system.

1 Like

If your data/app is in Azure cloud and you want to use OpenAI models, one way not to leak your private data is to use Azure OpenAI Service. This way OpenAI does not get to read/use your data.

Yeah thanks for the help everyone. I ended up going with the Azure OpenAI for privacy reasons, and it’s working great!

1 Like

We do this in many cases, and our favored approach is to insulate our data from the LLM through embeddings. Vectors, after all, are like anonymized pointers. They allow us to understand the intent of the user with respect to the schema and the query.

2 Likes

That’s a very cool way to describe embeddings.

I’m on the same(ish) page. GPT to translate/filter the problem.

1 Like

Yes. Yes. Yes. Identify the user’s wants and needs first; then take relevant steps to deliver the correct information.

2 Likes

i highly recomment evervault to anonymise prompts

{
  "id": 1,
  "subject": "Computer Order Issue",
  "description": "Hi, my name is **Claude** and I placed an order for 3 supercomputers on **June 13th**. The tracking portal says they were delivered to **123 Front Street**, but I never received the shipping containers. Can I please have a refund?",
  "category": " Refund"
}

with evervault


{
  "id": 1,
  "subject": "Computer Order Issue",
  "description": "Hi, my name is **Cole** and I placed an order for 3 supercomputers on **<REDACTED DATE_TIME>**. The tracking portal says they were delivered to **Hettinger berg, Arizona**, but I never received the shipping containers. Can I please have a refund?",
  "category": " Refund"
}