Hi, I am planning to use the RAG (Retrieval Augmented Generation) approach for developing a Q&A solution with GPT. In this approach, I will convert a private wiki of documents into OpenAI / tiktoken embeddings and store in a vector DB (Pinecone). During prompting, I will retrieve similar documents from the DB, and pass that to the prompt as additional context. Will my documents be exposed to ChatGPT? Or do they stay private? I’d like to avoid leaking my Document data to OpenAI / ChatGPT.
Specifically this langchain python code as an example:
query = "What is the meaning of life?"
docs = pinecone.similarity_search(query, include_metadata=True)
llm = OpenAI(temperature=0, model_name='text-davinci-003')
print(llm(query))
chain = load_qa_chain(llm, chain_type='stuff')
response = chain.run(input_documents=docs, question=query)
print(response)
Thanks for the info. What approach should be taken to keep it truly private? Use an alternative open source LLM? Is there an API option that you can use to tell OpenAI to not read it?
For your example LangChain usage, data will be sent 1) to OpenAI to create the embeddings, 2) possibly to Pinecode if you include the content in metadata, and 3) to OpenAI when the content is stuffed into the prompt. Per their policy they don’t use it for training and don’t retain the data, so up to you to decide if that is sufficient for your data.
OpenAI will not use data submitted by customers via our API to train or improve our models, unless you explicitly decide to share your data with us for this purpose.
Any data sent through the API will be retained for abuse and misuse monitoring purposes for a maximum of 30 days, after which it will be deleted (unless otherwise required by law).
If you don’t want to send the data to a third-party API you’re going to need to find a replacement library for calculating the embeddings, storing the vectors, and a LLM.
Check this video (Talk to YOUR DATA without OpenAI APIs: LangChain - YouTube) they present you a solution with free tools but not sure your will keep your data inside your premises. If you finally find the way to do it, please share it.
If you are using a third party as a proxy for API calls into OpenAI and the third party has decided to opt-in its data for API calls (which is possible), then your assumption that API calls are not using your data is wrong.
Any party issuing API calls to OpenAI on your behalf is free to modify the default opt-out policy of OpenAI. They might do it unwittingly or intentionally to cut a deal to lower inferencing costs. You must read their own privacy policy to understand if there is a risk.
I don’t know if this will solve your issue. But one way I’ve found to interact with data using Chat GPT without ever sending anything but table headers, is using CHAT GPT only to draft the query, then running it on your system to get the output. Any follow up questions can be based on the query and not the output data. Meaning data never leaves your system.
If your data/app is in Azure cloud and you want to use OpenAI models, one way not to leak your private data is to use Azure OpenAI Service. This way OpenAI does not get to read/use your data.
We do this in many cases, and our favored approach is to insulate our data from the LLM through embeddings. Vectors, after all, are like anonymized pointers. They allow us to understand the intent of the user with respect to the schema and the query.