"davinci-002" performing worse than deprecated "curie" model

Hello, I’m using the deprecated curie model for intent classification, it is fine-tuned using a dataset with the following format:
{"prompt": "<question> ->", "completion": " <intent_id>|"}, where intent_id is an integer that maps to some intent, with about 35,000 examples like this.

This is working flawlessly with curie, using 8 epochs, and it fits within my budget. So, after the deprecation notice I went ahead and trained the new “davinci-002” model, the recommended replacement, just to find that the performance is awful, it does not recognize almost any of the intents, and looking at the logprobs the first token of the sequence has about 4% probabilities, with the same question the curie model had 99%. My question is, is this expected? should I tweak the hyperparameters? because if having more epochs say 16, is necessary, that would be too much money for me. Is there anything else I can do? Is gpt-turbo my only option?

Edit: I’m using temperature 0 and the rest of the parameters are default, both models are tested equally.

2 Likes

Yep. It’s just bad.

davinci (legacy), role and 1-shot:

image
The only bad is the gpt-3 models like to end output prematurely (the new ones are the opposite and can’t quit).

davinci-002 (Jason Mraz lyrics to a hallucinated song):

davinci-002 does know the lyrics still, but it takes writing the AI’s reply and two lines of lyrics for it to complete.

1 Like

I see, that’s too bad, anyways just wanted to be sure wasn’t doing something terribly wrong, I guess I’ll have to look into a solution. Thanks for answering!