Fine-tuning and worse results that base Davinci

klrpg · January 27, 2023, 2:44pm

I have a file with a few thousand reviews that we would like to use in our own internal chatbot.
When I created the jsonl file to fine-tune the model, I used something similar for the prompts:
“This review was [positive|negative]. Written in December 2022. What do you think of [FeatureX]?”
[FeatureX] was extracted from the review text itself, to try an make the prompt match the review. For example, if the review mentioned anything to do with “music”, the prompt would be “What do you think of the music?”

The completion part was the review itself.

Now, there are some issues with the model:

Compared to base davinci, my model returns very short sentences, despite some of the reviews being paragraphs long.
If I ask my model to list me 5 complaints about the music, it doesn’t know how to handle it and just spews back a line or two from the reviews. In general it seems to perform worse than the base model.

Did I do something wrong during the fine-tuning process? Are the prompts too over-fitting?
Is there anything I can do to remedy these?

logankilpatrick · January 30, 2023, 9:01pm

What model parameters are you passing when you make the call to the completions API now? There are likely some things you can do to tune it there.

Without seeing the data, it is hard to know if it is over fitting or note, my general suggestions tends to be start with a small set of data (~25 examples) and see how the model behaves as you increase the number of examples in the set.

klrpg · January 31, 2023, 12:48pm

Hey Logan,

The parameters I’m setting are:
Temperature: Between 0.2 and 0.6 (doesn’t seem to make much difference honestly)
max_tokens: 500,
Frequency Penalty: 0.75

Do you have any pointers on adjusting these for better results?

PaulBellow · January 31, 2023, 3:20pm

That’s likely your problem… or part of it.

Set it to .02 or maybe even 0… and give it a try.

As Logan said, it’s also about how you formatted your training data and what prompt you’re using.

Good luck!

klrpg · January 31, 2023, 3:51pm

If the penalty is set very low, then I get a lot of repeated sentences that go on forever.

EDIT: Setting it to 0.1 improves it considerably though

PaulBellow · January 31, 2023, 3:54pm

Yeah, it’s a balancing act for sure.

The lots of repeated sentences with a .1 freq_penalty might mean you’re overfitting, though. The lower temperature (< .7) will cause repetition too, so you might try raising that rather than the freq_penalty.

mcenedella · January 31, 2023, 9:49pm

hey logan - what’s the best source / aggregation for these practical tips in finetunings? advice and conversations on twiddling with the params?

logankilpatrick · February 1, 2023, 2:51pm

The fine tuning guide is the best right now, but likely needs to be updated: OpenAI API in the coming weeks

Topic		Replies	Views
Should prompts be unique for fine-tuning? Prompting	9	1727	December 25, 2023
Fine tuning - how exactly does it work? API	6	2594	December 23, 2023
Struggling with poor performance on fine-tuned davinci model API	15	2677	December 20, 2023
Fine Tuned Chatbot forgets how to output summary of conversation API	9	1860	December 18, 2023
Fine tuned model providing worse output Prompting	6	2064	March 7, 2023

Fine-tuning and worse results that base Davinci

Related topics