Should prompts be unique for fine-tuning?

I created a fine-tuned Davinci model with 508 prompt/completion pairs. It was based on synthetic data I created using the Davinci model.

However, I find the quality of outputs is worse with my fine-tuned model than with the regular Davinci model.

I have a lot of duplicate prompts in my prompt/completion pairs (not duplicate completions).

There are 13 different prompts, all with different completions.

Would this be causing it to perform worse?

The demarc token you use could be causing degradation in performance. Even whether or not you use a space after the demarc can cause strange performance changes.

Here are a couple example prompt/completion pairs. Would you suggest I rerun a training with any changes to the prompt/completion pair structure?

{"prompt":"Input: An ebook on how to use CRM to manage your team more effectively\nOutput:","completion":" From Chaos to Customer Management in 30 Days or Less END"}
{"prompt":"Input: An ebook on how to use CRM to manage your team more effectively\nOutput:","completion":" You're not managing your team effectively if you're not using CRM END"}

I saw that when including the prompt in every line, the performance seems to suffer.
Include the prompt into the normal request and fine-tune only the unique parts.
But don’t listen to me, since I’m quite a novice, still. Instead, listen to @daveshapautomator. He’s great!

EDIT: Maybe try to generate 100s of variations of the phrase “An ebook on how to use CRM to manage your team more effectively” on the playground and then use those as the Input prompt for the fine tuning.

But from my perspective this doesn’t seem like a task that needs fine-tuning. Coming up with catchphrases is already a task that Davinci does extremely well. Finetuning just clamps the creativity down (it becomes deteministic), unless you have thousands of unique, curated, examples.

Thanks for the ideas @daveshapautomator & @Fusseldieb.

I ended up rerunning the same training with some variations to the epochs and learning rate multiplier, and the results are much better now.

1 Like

Interesting!

Did you use lesser epochs?
I’ve never tried to fiddle around with the learning rate multiplier also. What did it do and which config did you use (for how many examples) ?

Why did you add a space in front of the completion results? Just curious.

Yes, I tried it with fewer epochs. Total prompt/completion pairs: 508.

The default number of epoch is 4 epochs, which didn’t give me great results. I ran it with 1 epoch and 2 epochs. 1 epoch gave slightly better results for creative text. Both 1 & 2 epochs gave better results than 4 epochs for creative text.

I also tried adjusting the prompt learning rate, but I don’t have anything definitive for the results from that. But, I only tried setting it manually at .02, and .05.They both produced good results.

I ran the Python fine tuning data preparation tool on my data set, and it recommended adding a space.

It has to do with how they tokenize words that have a whitespace before them.

Here is more info from the docs. OpenAI API

1 Like