RAG on private dataset via LangChain, does OpenAI / ChatGPT get access to the documents?

mokarma · May 10, 2023, 8:01pm

Hi, I am planning to use the RAG (Retrieval Augmented Generation) approach for developing a Q&A solution with GPT. In this approach, I will convert a private wiki of documents into OpenAI / tiktoken embeddings and store in a vector DB (Pinecone). During prompting, I will retrieve similar documents from the DB, and pass that to the prompt as additional context. Will my documents be exposed to ChatGPT? Or do they stay private? I’d like to avoid leaking my Document data to OpenAI / ChatGPT.
Specifically this langchain python code as an example:

query = "What is the meaning of life?"
docs = pinecone.similarity_search(query, include_metadata=True)
llm = OpenAI(temperature=0, model_name='text-davinci-003')
print(llm(query))
chain = load_qa_chain(llm, chain_type='stuff')
response = chain.run(input_documents=docs, question=query)
print(response)

Context: Question Answering Over Documents | 🦜️🔗 LangChain

louis030195 · May 10, 2023, 8:12pm

yes your documents are read by openai
also pinecone if you store the plain text in metadata

mokarma · May 10, 2023, 8:15pm

Thanks for the info. What approach should be taken to keep it truly private? Use an alternative open source LLM? Is there an API option that you can use to tell OpenAI to not read it?

novaphil · May 10, 2023, 8:40pm

For your example LangChain usage, data will be sent 1) to OpenAI to create the embeddings, 2) possibly to Pinecode if you include the content in metadata, and 3) to OpenAI when the content is stuffed into the prompt. Per their policy they don’t use it for training and don’t retain the data, so up to you to decide if that is sufficient for your data.

OpenAI will not use data submitted by customers via our API to train or improve our models, unless you explicitly decide to share your data with us for this purpose.

Any data sent through the API will be retained for abuse and misuse monitoring purposes for a maximum of 30 days, after which it will be deleted (unless otherwise required by law).

If you don’t want to send the data to a third-party API you’re going to need to find a replacement library for calculating the embeddings, storing the vectors, and a LLM.

jmblasco · May 11, 2023, 4:56am

Check this video (Talk to YOUR DATA without OpenAI APIs: LangChain - YouTube) they present you a solution with free tools but not sure your will keep your data inside your premises. If you finally find the way to do it, please share it.

mokarma · May 11, 2023, 1:12pm

Thanks for the responses everyone, I’m going to research a bit more and see if there’s a truly private approach. I’ll report back here. Thanks!

bill.french · May 11, 2023, 2:08pm

If you are using a third party as a proxy for API calls into OpenAI and the third party has decided to opt-in its data for API calls (which is possible), then your assumption that API calls are not using your data is wrong.

Any party issuing API calls to OpenAI on your behalf is free to modify the default opt-out policy of OpenAI. They might do it unwittingly or intentionally to cut a deal to lower inferencing costs. You must read their own privacy policy to understand if there is a risk.

ryans · May 11, 2023, 3:44pm

You could do all this using Azure OpenAI which would meet your security concerns.

Depending on what wiki site your using - Mantium has a Notion connector (along with pdf, docx, etc) that would automate your pipelines. Here’s a video Full Tutorial: Chat with your Data Using OpenAI ChatGPT Plugins and Mantium - YouTube

pkcheese · May 15, 2023, 8:25am

I don’t know if this will solve your issue. But one way I’ve found to interact with data using Chat GPT without ever sending anything but table headers, is using CHAT GPT only to draft the query, then running it on your system to get the output. Any follow up questions can be based on the query and not the output data. Meaning data never leaves your system.

slobodan.milovanovic · June 28, 2023, 9:10am

If your data/app is in Azure cloud and you want to use OpenAI models, one way not to leak your private data is to use Azure OpenAI Service. This way OpenAI does not get to read/use your data.

mokarma · June 28, 2023, 3:14pm

Yeah thanks for the help everyone. I ended up going with the Azure OpenAI for privacy reasons, and it’s working great!

bill.french · June 29, 2023, 9:08pm

We do this in many cases, and our favored approach is to insulate our data from the LLM through embeddings. Vectors, after all, are like anonymized pointers. They allow us to understand the intent of the user with respect to the schema and the query.

anon10827405 · June 29, 2023, 9:49pm

That’s a very cool way to describe embeddings.

I’m on the same(ish) page. GPT to translate/filter the problem.

bill.french · June 29, 2023, 11:22pm

Yes. Yes. Yes. Identify the user’s wants and needs first; then take relevant steps to deliver the correct information.

louis030195 · October 21, 2023, 9:29pm

i highly recomment evervault to anonymise prompts

{
  "id": 1,
  "subject": "Computer Order Issue",
  "description": "Hi, my name is **Claude** and I placed an order for 3 supercomputers on **June 13th**. The tracking portal says they were delivered to **123 Front Street**, but I never received the shipping containers. Can I please have a refund?",
  "category": " Refund"
}

with evervault


{
  "id": 1,
  "subject": "Computer Order Issue",
  "description": "Hi, my name is **Cole** and I placed an order for 3 supercomputers on **<REDACTED DATE_TIME>**. The tracking portal says they were delivered to **Hettinger berg, Arizona**, but I never received the shipping containers. Can I please have a refund?",
  "category": " Refund"
}

Topic		Replies	Views
Training OpenAI on a private dataset API	19	55755	December 12, 2023
Data Privacy and limitations Community privacy	8	2765	December 16, 2023
Security around client data Community	3	814	July 30, 2021
Train chatGPT on confidential dataset API	13	7309	December 23, 2023
Your GPT knowledge base - What happens with confidential data Prompting gpt-4	12	17621	December 1, 2023

RAG on private dataset via LangChain, does OpenAI / ChatGPT get access to the documents?

Related topics