Is it possible to train a model from my own private documents?

Hi there!
I am working on a project where we have tens of thousands of pages of information, and need to do e-Discovery on them.
What I would like to do is use OpenAI API to send questions, review the documents only in my binder, and get back answers.
Is this possible using OpenAI? I don’t want information to come from the vast ocean of all content - just the content in my binder.

Then there are some privacy concerns, around confidential information getting leaked…but one question at a time.
Thank you for any insights.

Hi and welcome to the Developer Forum!

Data sent to the API is not used for training, so it will not make it’s way into any training data and is only kept for 30 days for legal compliance and then deleted.

There is a method called RAG which is Retrieval Augmented Generation, it takes advantage of a data storage technique called a vector database. These vector stores hold sematic vectors of text and can be searched by similarity, not on the words but on the underlying meaning held by those words, this makes it ideal for pulling only the relevant information from a large corpus.

You can then pass that information as context to the AI to answer queries.

The method for storing the data is called embedding, details can be found here

https://platform.openai.com/docs/guides/embeddings/what-are-embeddings

The amount of data you have may be too large for the current Assistants model, which is a pre built system that contains all of the above elements in an easy to build API library, so some experimentation may be required. Normal vector database methods have effectively unlimited storage, so there are options to solve your user-case.

Let me know if you need any more information.

1 Like

Hi Jason,

I encountered the same issue as you. My goal was to supplement ChatGPT’s knowledge with specific documents through fine-tuning using the API endpoint https : // api . openai . com / v1 / fine_tuning / jobs. However, the results were unexpected. For example, I trained the model with entries like: “My name is Oibaf Ilegna and I’m a writer. My last novel is this…” Despite this, ChatGPT could not identify ‘Oibaf Ilegna’ after the training.

Here’s an example of my training data:

{
  "messages": [
    {"role": "system", "content": "My first test"},
    {"role": "user", "content": "What's your name?"},
    {"role": "assistant", "content": "My name is Oibaf Ilegna"}
  ]
}

When asked, ChatGPT responded inaccurately. This illustrates key aspects of fine-tuning with GPT-3.5:

  1. Learning and Memory: GPT-3.5 doesn’t store information traditionally but learns patterns to generate plausible responses. Even with fine-tuning, there’s no retention of specific facts.

  2. Generalization: Despite fine-tuning, models tend to generalize from their extensive initial training, often overlooking specific names or details introduced in limited examples unless heavily emphasized.

  3. Training Data Nature: The content of fine-tuning can affect how well the model recognizes or recalls specific information. Sparse question-answer pairs might not provide enough context for the model to learn specific details reliably.

  4. Probabilistic Responses: Post fine-tuning, responses are still probabilistic based on the training data and not explicit understanding. The model might ignore or contradict inputs from training if not sufficiently reinforced.

Tips to Improve Detail Recognition in Fine-Tuning:

  • Increase Example Volume: Providing more examples featuring the name and relevant information could help make these details more prominent and recognizable.

  • Contextualized Approach: Include the name and details in various sentence types and contexts to aid the model in generalizing the recognition across different queries.

  • Evaluate with Multiple Tests: Testing the model with various prompts related to the fine-tuned information helps assess how it performs under different conditions and identify any confusing patterns or prompt types.

In summary, while fine-tuning can enhance the model’s responses to reflect training data better, it doesn’t ensure consistent recognition of specific details like names or personal information unless they are a strong and repeated focus of the training set.

Considering these outcomes, I’m exploring another approach, such as Retrieval Augmented Generation. Have you tried this method? Let’s keep in touch!

Best,
Fabio

Hi, may I ask a question?

Open AI has GPTs for users to build their custom GPT. In its configuration, it allows the users to upload files. So when the custom GPT retrieves knowledge from the content of uploaded file, is RAG perfomred? Or in order to perform RAG to enhance my custom GPT, I need to build another retrieval system?

when the custom GPT retrieves knowledge from the content of an uploaded file, it’s done through OpenAI’s RAG solution, if you’d like to have more control over this behavior you’ll need to build your own retrieval system :laughing:

2 Likes

OK. Thank you very much for your help.

1 Like