Fine-tuning and worse results that base Davinci

I have a file with a few thousand reviews that we would like to use in our own internal chatbot.
When I created the jsonl file to fine-tune the model, I used something similar for the prompts:
“This review was [positive|negative]. Written in December 2022. What do you think of [FeatureX]?”
[FeatureX] was extracted from the review text itself, to try an make the prompt match the review. For example, if the review mentioned anything to do with “music”, the prompt would be “What do you think of the music?”

The completion part was the review itself.

Now, there are some issues with the model:

  1. Compared to base davinci, my model returns very short sentences, despite some of the reviews being paragraphs long.
  2. If I ask my model to list me 5 complaints about the music, it doesn’t know how to handle it and just spews back a line or two from the reviews. In general it seems to perform worse than the base model.

Did I do something wrong during the fine-tuning process? Are the prompts too over-fitting?
Is there anything I can do to remedy these?

What model parameters are you passing when you make the call to the completions API now? There are likely some things you can do to tune it there.

Without seeing the data, it is hard to know if it is over fitting or note, my general suggestions tends to be start with a small set of data (~25 examples) and see how the model behaves as you increase the number of examples in the set.

Hey Logan,

The parameters I’m setting are:
Temperature: Between 0.2 and 0.6 (doesn’t seem to make much difference honestly)
max_tokens: 500,
Frequency Penalty: 0.75

Do you have any pointers on adjusting these for better results?

That’s likely your problem… or part of it.

Set it to .02 or maybe even 0… and give it a try.

As Logan said, it’s also about how you formatted your training data and what prompt you’re using.

Good luck!

1 Like

If the penalty is set very low, then I get a lot of repeated sentences that go on forever.

EDIT: Setting it to 0.1 improves it considerably though :slight_smile:


Yeah, it’s a balancing act for sure.

The lots of repeated sentences with a .1 freq_penalty might mean you’re overfitting, though. The lower temperature (< .7) will cause repetition too, so you might try raising that rather than the freq_penalty.

hey logan - what’s the best source / aggregation for these practical tips in finetunings? advice and conversations on twiddling with the params?

The fine tuning guide is the best right now, but likely needs to be updated: OpenAI API in the coming weeks

1 Like