Fine Tune GPT 3.5 for your own Knowledge/Facts

We are having hard time trying to fine tune ChatGPT for our product’s knowledge. The Fine Tuning does mimics the exact tone of the answers while facts are not correct. We want it to memorize the information we pass in our training.

Here are three sample entries of our training data:
(Replaced product name with XYZ to post here)

{"messages": [{"role": "system", "content": "You are XYZ Help Assistant and answer queries in context of XYZ."}, {"role": "user", "content": "Is there a MBs limit on import size?"}, {"role": "assistant", "content": "Yes, XYZ only allows imports of upto 10 MBs."}]}
{"messages": [{"role": "system", "content": "You are XYZ Help Assistant and answer queries in context of XYZ."}, {"role": "user", "content": "What is the file size limit on import?"}, {"role": "assistant", "content": "XYZ only allows files sizes of upto 10 MBs."}]}
{"messages": [{"role": "system", "content": "You are XYZ Help Assistant and answer queries in context of XYZ."}, {"role": "user", "content": "Are there limitations on the number of records or file size for imports?"}, {"role": "assistant", "content": "The limit on number of records is 1000 and file sizes of 10 MBs are allowed in XYZ."}]}
....
....
....

Here is an entry in our validation data:

{"messages": [{"role": "system", "content": "You are XYZ Help Assistant and answer queries in context of XYZ."}, {"role": "user", "content": "Are there any restrictions on the volume of records or the size of files I can import?"}, {"role": "assistant", "content": "In XYZ, the import limits include a maximum of 1000 records and a file size cap of 10 MBs."}]}

Here are the results we get:

System was same in all these conversations:

Test1:

{
      "role": "user",
      "content": "Are there any restrictions on the volume of records or the size of files I can import?"
    },
    {
      "role": "assistant",
      "content": "XYZ only limits file sizes to 10 MBs. There are no specific limitations on the volume of records you can import as long as the file sizes are within this limit."
    }

Test2:

        {
      "role": "user",
      "content": "Is there any restrict on the file size I can import?"
    },
    {
      "role": "assistant",
      "content": "Yes, XYZ allows files sizes of up to 25MBs."
    }

Test3:

       {
     "role": "user",
     "content": "Are there limitations on the number of records or file size for imports?"
   },
   {
     "role": "assistant",
     "content": "XYZ allows imports of up to 100 records at a time, and each file must be less than 1 MB in size. If you have more than 100 records, you would need to split them across multiple files and import them separately."
   }

We can see that there is error in the information that Chat Completions API returned when we tested our fine tuned model.

Any clues on how we can make ChatGPT memorize our knowledge and facts/figures?

Any guidance will be appreciated!

Hi there - finetuning is not intended to enable knowledge/facts retention but instead it allows you train the model to adopt certain behaviors.

Hi! You should look into RAG (retrieval augmented generation). Also have a look at this thread.

Depending on the size of the knowledge base, you can consider:

  1. Sending the data in each chat completion request as system prompt and use the regular chat completion API, or
  2. Use the Assistant API which supports file upload

Fine-tuning is generally not the most suited for new knowledge.

  1. Our knowledge base is pretty big. Several documents each containing around 1000-2000 words.
  2. Assistants API sounds promising but it’s in beta.

The embedding technique would be expensive for us to maintain and retrieve the documents. Can we get any guy from OpenAI to tell us the best practices to fine-tune models to remember/memorize the information?

FYI

I have experienced the same problem with syntax for scripting language and I think of using another solution to use a normal sort of database/Script for finding the part of knowledge that is currently needed.

And then use the AI just to pick the right piece out of it and formulate it nicely.

The problem is that if you give the AI a large bunch of data (like a users manual for the SPR) it seems currently not to be able to make something useful out of this.

Especially if we are talking about exact science.
Of course the AI can do “as if she knows about it” but the results have no factual foundation.
But if you give the AI just the Piece of knowledge that things are around.
And ask her to formulate it nicely and rework it, this will work.

In syntax’s tangled web, I too have fought, :spider_web:
To script a language, a vexing thought. :thinking:
A novel solution now I seek, :hammer_and_wrench:
A database normal, robust, not meek. :books:

To harvest knowledge’s ripest fruit, :apple:
The AI’s hand shall deftly suit. :robot:
To pluck, to choose, to articulate, :art:
With elegance, the facts we state. :memo:

Yet when data vast we pour, :ocean:
The AI falters, can’t explore. :no_entry_sign:
In exacting science, we must confide, :microscope:
That precision’s key, with no room to slide. :old_key:

#SyntaxStruggle #ScriptingSaga #AIChallenges #DataDilemma #ExactScience