Fine-tuned model unable to answer prompts from training data

Hello,

I’ve followed the tutorial about fine-tuning and was surprised that the model (based on davinci) could not answer correctly the questions present in the jsonl file.

For instance, the jsonl file contains:
{“prompt”:“What is the color of the XXX heat pump ?”,“completion”:" Blue\n"}

But the response given by the model to the very same question is “It is black”.

It seems that the model is using exterior knowledge. How can it be made to exploit this training data in priority?

Regards,
Kevin

Fine-tuning teaches the model howto answer.

Embeddings provide a way for models to access information outside their initial training data.

Hi Kevin, how many epochs did you use. I found in some cases if I increase the epochs, it starts referring to the training completions

Do you allude to the “embeddings” approach where the prompt is matched with different corpuses of texts? The problem with this approach is that the response to the prompt has to be found within the selected corpus while some prompts may address many different corpuses.

I only ran the tutorial, and the number of epochs was 4 by default. I’ll check how it can be modified.

How are you training your data? i.e. how are you training the model to know that the XXX heat pump is blue?
If it’s just from the Q&A samples alone, it probably won’t help - as @elmtedt noted, you’re essentially just training the model to respond a certain way rather than “understand” why the XXX heat pump is blue.

If you want to fine-tune the model to understand different heat pumps and their properties (colors, etc) I think there are two options:

  1. If the data you want to train is concise enough, put it into each training sample prompt. For example you could start each prompt with a list of heat pumps and their properties. Then the Q&A training would be able to refer directly to that data and associate the answers with it. The downside to this approach is that the latest fine-tuned models (i.e. davinci) only allow up to 2048 tokens for prompt+completion which may not be enough to fit all of your domain knowledge.
  2. If you need to draw from a larger corpus of text about the subject matter (heat pumps), then you should try the sequential fine-tuning approach outlined in OpenAI’s draft guide. That is, first train with unstructured documents that describe your domain knowledge about heat pumps. Then, do another fine-tune job of that pre-trained model with the Q&A. This will “teach” the model to associate the XXX heat pump with the XXX heat pump it learned about from unstructured data.

I just followed the tutorial about fine-tuning models, so I just used a jsonl file as explained in my post. My intent was not to make the model understand why the heat pump is blue, just to learn by heart that it is blue. Following @joyasree78’s advice, I increased the number of epochs to 50, and the model now give the right answers, followed by gobbledygook. I still need to investigate that.

The sequential fine-tuning approach looks interesting, I guess that learning from unstructured documents would allow the model to learn the business language. But how can you train a model on such data? I haven’t found any section about model training in OpenAI guides.

1 Like

Look into Hypothetical Document Embeddings (HyDE).

1 Like

I don’t see how that proposal (generating a hallucinated document to generate an embedding to then do embedding based retrieval) would help with the original problem of “fine tuning a few steps doesn’t losslessly encode knowledge into the model.” It looks to me like a totally different mechanism, solving a totally different problem?

yes if u increse the epoch
it will increase the accuaracy of model
makr sure epoch should be above 8

1 Like