Is it possible to train a model from my own private documents?

Hi there!
I am working on a project where we have tens of thousands of pages of information, and need to do e-Discovery on them.
What I would like to do is use OpenAI API to send questions, review the documents only in my binder, and get back answers.
Is this possible using OpenAI? I don’t want information to come from the vast ocean of all content - just the content in my binder.

Then there are some privacy concerns, around confidential information getting leaked…but one question at a time.
Thank you for any insights.

Hi and welcome to the Developer Forum!

Data sent to the API is not used for training, so it will not make it’s way into any training data and is only kept for 30 days for legal compliance and then deleted.

There is a method called RAG which is Retrieval Augmented Generation, it takes advantage of a data storage technique called a vector database. These vector stores hold sematic vectors of text and can be searched by similarity, not on the words but on the underlying meaning held by those words, this makes it ideal for pulling only the relevant information from a large corpus.

You can then pass that information as context to the AI to answer queries.

The method for storing the data is called embedding, details can be found here

https://platform.openai.com/docs/guides/embeddings/what-are-embeddings

The amount of data you have may be too large for the current Assistants model, which is a pre built system that contains all of the above elements in an easy to build API library, so some experimentation may be required. Normal vector database methods have effectively unlimited storage, so there are options to solve your user-case.

Let me know if you need any more information.