Fine-tuning myths / OpenAI documentation

A very common use case for GPT involves question answering with external data. Wherever you look, people inquire about the best way to do this. Alongside those inquiries are heated arguments about whether or not fine-tuning is a viable option for this use case. And, if it is, then why are all of the services that offer question answering on custom data based on retrieval augmented generation rather than fine-tuning?

People like David Shapiro are adamant that fine-tuning cannot be used to reliably add knowledge to a model. At around 2:20 in this video he begins his assertions about this.

Then there are people claiming to have successfully added data to OpenAI models through fine-tuning, such as in this Tutorial from this forum.

Unfortunately, the OpenAI documentation on fine-tuning is not very helpful here. The explanation of what fine-tuning can accomplish includes “Ability to train on more examples than can fit in a prompt”, but this is vague. While I feel the statement refers to few shot examples for pattern conditioning tasks like classification, I’ve seen others believe it refers to examples of factual statements.

Can someone from OpenAI chime in here with an authoritative answer as to whether fine-tuning can be used to reliably add knowledge to a model? And, can the documentation be updated with a clear answer to this question?

EDIT: my hypothesis is that fine-tuning will not reliably assimilate knowledge into any OpenAI LLMs. More specifically, fine-tuning is categorically not a viable way to incorporate new knowledge to enable question answering on data that the model was not trained on. You may be able to hammer on a model and get it to spit out a couple of correct answers, but you will not be able to get results that are anywhere near as reliable as if you used retrieval augmented generation (which, itself, is far from perfect.)


It’s rare that OpenAI employees respond to these forum posts, so I’ll chime in.

Just because you could do something doesn’t mean you should.

Plus it depends on your use-case. Your post is too broad to recommend one solution or the other. Both have their merits. But it depends on exactly what you’re trying to achieve.

Lastly, a sufficiently complex solution will likely use both embeddings and fine-tuning. Sure, many solutions right now use one or the other, but those are the low-hanging fruit. Very soon people will expect more and you’ll have to design systems that leverage the capabilities of both techniques.

1 Like

Thanks for the reply.

Well, I linked to two specific examples of what people were trying to achieve. Below are the links I provided with a relevant excerpt for each. This would seem to be enough information to provide specific guidance as to whether fine-tuning is viable for the use case. If it’s only viable under certain refinements of those examples, then what might such refinements consist of.

  1. Can this api be used to query internal data?
    I would like users in my company to be able to consult the thousands of pdf, doc and xls files that have been generated for more than 30 years, making offers through a chat bot and through natural language and I don’t know if this is possible .

  2. Reddit - Dive into anything
    Hey guys, I want to train any LLM on my company’s data we have stored in Azure and Snowflake
    It’s all in tabular form, and I was wondering how can I train an LLM on the data, and be able to ask it questions about it. No computations required from the model, but at least be able to tell answer questions such as: What was Apple’s return compared to it’s sector last month ( we have financial data)


In the OpenAI documentation itself it’s recommended to use embeddings instead of fine-tuning for knowledge. This is straight from OpenAI. Documentation has been stating it for a while which is why we are adamant about it.

Fine-tuning is not knowledge.

I can confidently say that people who tend to believe that it is, end up with fine-tuned models which spit absolute nonsense.

You can find the message (from OpenAI) below in their cookbook.

Note: To answer questions based on text documents, we recommend the procedure in
Question Answering using Embeddings.

Not only is fine-tuning much more expensive, you are using a much lesser model than what you are comfortable with.


That’s my sense as well. But as you can see from @wfhbrian’s response, some people feel it is not this cut and dry. Whichever position you take (i.e. fine-tuning is viable for adding new knowledge vs. fine-tuning does not reliably add new knowledge vs. it depends), you can find support for that position in one or another forum post / article / video.

This seems like a really important thing to pin down in a much more cogent fashion than I’ve seen.

It’s very easy to say “not in all situations” as a catch-all. However in the case of adding knowledge, embeddings is the way to go. Of course, eventually I would think that any person would say “Okay, now I can use fine-tuning with my embeddings for better results”, which is what I do. However not in the sense of sharing the load, but actually in a completely different process of refinement that works with embeddings. It’s like having an assembly line. I use a separate (ada) model for classification so that my embeddings come back more accurate.

  • It’s much cheaper
    If you are attempting to augment your chatbot with knowledge, you are probably going to use Davinci as its the best possible model. The lesser models aren’t very communicative. Not only will the model still have less ability than GPT3.5 and GPT4, it will cost roughly over 10x more than simply using GPT3.5 with embeddings

To repeat, you’re using a lesser model, at over 10x the price point.

  • The model you are using is a much lesser version.
    If you want a communicative chatbot, you want GPT3.5 (as of today) or 4. If you want to see the obvious differences between GPT3.5 and Davinci (not text-davinci-003, Davinci), try it yourself in the playground and see how different they are.

  • Information can’t be modified
    If you fine-tune your model you are left with knowledge that cannot be selectively “un-fine-tuned”. You would either need some seriously repetitive check points, and would need to re-train the model with the changed information. Embeddings, it takes 5 seconds

  • Fine-tuning is not “Knowledge in, knowledge absorbed”
    There seems to be this idea that you fine-tune knowledge, and it’s learned. Which is why most people fail at fine-tuning their model. While the model may learn some factual information during fine-tuning, its primary focus is on learning the structure, style, and patterns in the data. This can cause it to corrupt easily.

These are just a couple obvious points.

Regardless, you asked for OpenAI’s statement, and it’s there, clear as day in their documentation.

1 Like

Where? Are you referring to the recommendation of using RAG for question answering? If so, I don’t think that’s at all dispositive. It provides a solution to a use case, but it doesn’t state that the alternative (i.e. fine-tuning) is not viable for that use case.

Literally saying that “To answer questions based on text documents, we recommend the procedure of using embeddings” in a tutorial of using fine-tuning for knowledge is not satisfactory for you?

Just look at their cookbook. All the tutorials for “knowledge by fine-tuning” directly say: “use embeddings”

Can you seriously justify spending over 10x to use a much lesser model just to use fine-tuning for knowledge?

I believe you want it to be done by fine-tuning rather than you are looking for an answer. Not even a devil’s advocate could ignore this.

Another one

Note: To answer questions based on text documents, we recommend the procedure in […]

These are written by OpenAI.

1 Like

Quite the opposite. I want to establish for once and for all that it can NOT be done through fine-tuning.

Like @RonaldGRuckus said, OpenAI themselves add knowledge with embeddings not fine-tunes! In particular, semantic search with embeddings, stuff the prompt with this information, and ask GPT to use this as context when answering a question.

NOW, however, we have seen GPT answer questions via fine-tunes, if when you train it, you set your epochs really high, at least 16 (from the default 4). You can certainly try that too! It is possible, but then you are locked into a fine-tune that you can’t add knowledge too easily, assuming your knowledge changes.

You can do it with fine-tunes! But it’s not optimal.

Try both! But I, and most folks prefer the embedding route, even though it is more work.

Here is the Mega-Thread on high-epoch fine-tunes, good luck, and don’t overfit!

1 Like

The points should be enough to justify it. It costs much more, is less effective, it uses a lesser model, and it isn’t mutable like embeddings are. You can do it, but why would you?

1 Like

I should clarify why this topic is so important to me. I’ve started to invest a lot of time developing my ability to implement RAG systems for question answering. I am also thinking of productizing some of what I’ve built. If it turns out, however, that fine-tuning is viable for question answering and OpenAI releases the ability to fine-tune the recent models, then I fear I will have done much of this for no reason.

Assuming they do, what benefits do you find in fine-tuning over embeddings?

@wfhbrian Thanks for the clarification. I read it that way as well.

1 Like

In case my first post wasn’t clear, I completely agree with what @RonaldGRuckus is saying. Fine-tuning should not be used for recalling facts.

The main challenge/shortcoming I see with embeddings is that the final output is only as good as your ability to pull the relevant chunks during the IR step. For certain types of questions (i.e. describe the arc of the author’s narrative from where he begins to where he ends up), it can be quite difficult to pull all relevant chunks. Recursive summarization and other techniques (as well as the growing size of context windows) can be helpful here, but you still risk missing the requisite knowledge/data to answer the question.

By comparison, if the knowledge is absorbed into the model, you have the full force of the network’s inference to access everything relevant. I know the inference is far from perfect (and can lead to outright hallucinations), but it seems like getting the model to absorb new knowledge is the whole grail of question answering.

I did a fine-tune of a 70K words books. My initial expectation was to have the desired QA, and at that point I didn’t know any better. But this fine-tune showed me the limits of this approach. It just learned the style and stayed more or less within the corpus, but hallucinated a lot.

Then I split the book into sentences, worked my way through embeddings, and now I have a very decent QA system for the book, but for narrow questions. It is not as good for questions that need the context of the entire book.

I am confident that the future will bring the grail we want.

1 Like

That’s very cool. Would you mind sharing some of your prompts that you used for training?

1 Like

For the fine-tuning this is what I did:

For each sentence, I would have gpt-4 produce 3 questions and I would create 3 JSONL entries with these:

{prompt: {question}\n\n###\n\n, completion: {sentence} ENDEND}

This created a 9K-entries JSONL and I would fine-tune with it.

To create the questions, that’s a simple completion with a simple prompt in my case:

prompt: “Please create 3 questions based on the following sentence: {sentence}”

No magic :slight_smile:


Possibly, but @curt.kennedy drives home a key point. I don’t want to speak for him, but I took away the idea that the agility to support rapid change in facts through an embedding approach [in most cases] is vastly preferred over fine-tuning which is akin to pouring concrete. Sorry for the metaphors, but I get the sense that the concrete is the LLM; embeddings allow us to build cool stuff on that concrete.

If by RAG you mean Retrieval Augmented Generation, then go with embedding, RAG IS THE DEFINITION OF EMBEDDING!

Fine-tunes act more like filters in the traditional digital signal processing sense. If your filter is narrowband, we call this a classifier. If it is broadband, we call this a personality or tone matcher.

But, as in DSP, filters are not a source of information, they can only shape it. So you need an external source of information generation. Currently this comes from the prompt and LLM. Hence embeddings.

As the window size gets bigger and summary tricks improve, you could get the LLM to follow the story arc and appear to “absorb” the book. When you look at how these LLM’s are trained, they are trained on books, but their objective is to impersonate, become a sensible stochastic parrot. Luckily the “fake it till you make it’ objective Is all that is required for a minimum viable product.

But there are other model architectures that aren’t just good babblers. When you link facts and truths in a more structured graph manner, for example, you get a much more logical and sophisticated AI. To me, this is the holy grail!