"davinci-002" performing worse than deprecated "curie" model

luis.ernesto951008 · September 8, 2023, 2:23pm

Hello, I’m using the deprecated curie model for intent classification, it is fine-tuned using a dataset with the following format:
{"prompt": "<question> ->", "completion": " <intent_id>|"}, where intent_id is an integer that maps to some intent, with about 35,000 examples like this.

This is working flawlessly with curie, using 8 epochs, and it fits within my budget. So, after the deprecation notice I went ahead and trained the new “davinci-002” model, the recommended replacement, just to find that the performance is awful, it does not recognize almost any of the intents, and looking at the logprobs the first token of the sequence has about 4% probabilities, with the same question the curie model had 99%. My question is, is this expected? should I tweak the hyperparameters? because if having more epochs say 16, is necessary, that would be too much money for me. Is there anything else I can do? Is gpt-turbo my only option?

Edit: I’m using temperature 0 and the rest of the parameters are default, both models are tested equally.

_j · September 8, 2023, 5:04pm

Yep. It’s just bad.

davinci (legacy), role and 1-shot:

The only bad is the gpt-3 models like to end output prematurely (the new ones are the opposite and can’t quit).

davinci-002 (Jason Mraz lyrics to a hallucinated song):

davinci-002 does know the lyrics still, but it takes writing the AI’s reply and two lines of lyrics for it to complete.

luis.ernesto951008 · September 8, 2023, 7:16pm

I see, that’s too bad, anyways just wanted to be sure wasn’t doing something terribly wrong, I guess I’ll have to look into a solution. Thanks for answering!

Topic		Replies	Views
Replacing legacy curie model Deprecations	1	840	December 3, 2023
Struggling with poor performance on fine-tuned davinci model API	15	2625	December 20, 2023
GPT 3.5-turbo-0125 (Downgrade) API	4	1210	March 30, 2024
Gpt-3.5-turbo-instruct is NOT as good as Davinci-003! API api	4	2563	December 17, 2023
Fine tuned model providing worse output Prompting	6	2020	March 7, 2023

"davinci-002" performing worse than deprecated "curie" model

Related topics