Struggling with poor performance on fine-tuned davinci model

Hi community,

I’m having an issue with my fine-tuned davinci model where it is performing very poorly on completions with questions that are not related to the training data. For example, if I ask for business ideas on a model that was fine-tuned for therapist conversation, the results are very poor.

I was expecting the model to perform more like text-davinci-003, but it is performing closer to the original davinci model when facing unexacting prompts.

I’m wondering if I’m missing something or if there are any strategies that I could try to improve the model’s performance on these types of inputs. Any thoughts or suggestions would be greatly appreciated.

Thanks you so much.

here my model working the way it should work…

1 Like

here the model behaving differently than I expected…

very much like the davinci base model

differently from the text-davinci-003, that handle the prompt nicely

Can you give us an example or two of your training data? How big was your dataset? Did you change any fine-tuning settings?

Hi Paul. Sure!

Here is my training data. It’s not big. Around 250 examples.

I did not change fine-tuning settings. Here how I called the API:

Thanks. The training data looks okay, but…

  1. Might not be enough samples. If 250 isn’t working, try to use 500… then 1,000…

  2. You might also want to use a stop token at the end of the compmletions… (something like <|endoftext|>) so you can use that in the generation later)

What about the settings when you’re trying to generate - ie temperature, frequency_penalty, etc…

Might just not be enough samples, though.

Hope this helps!

Got it. Thank you for helping out, Paul.

Let’s assume I do not have enough training data for the model perform as expected in terms of getting the user the right answer, as the user “therapist”.

I’m wondering if the base model that was fine-tuned is not “text-davinci-003”, but rather the most basic version “davinci”. If that is the case, it may explain the model’s poor performance.

Could you tell me how your fine-tuned models perform with other prompts? Do they exhibit similar difficulties, or is this issue specific to the “therapist” prompt?

I have the same problem with a dataset of more than 500 lines.
The OpenAI support advised me to use Embeddings and not Completions for better result with Q&As.

Everything is explained here openai-cookbook/Question_answering_using_embeddings.ipynb at main · openai/openai-cookbook · GitHub

Actually, that cookbook isn’t very easy to understand. You might want to start with this:

1 Like

I was struggling with a bug (manipulating arrays) in the code available here em013 Doing and Embedded Semantic Search - YouTube

It works better using

df['similarities'] = df.ada_embedding.apply(lambda x:, searchvector) / (np.linalg.norm(x) * np.linalg.norm(searchvector)))


df['similarities'] = df.ada_embedding.apply(lambda x: cosine_similarity(x, searchvector))


1 Like

Thanks for the feedback. You can also use

df[‘similarities’] = df.ada_embedding.apply(lambda x:, searchvector) )

Because the lengths of the vectors are normalized to a length of 1. So you will always be dividing by 1

I left the cosine_similarity function in the video so the example matched the examples on the OpenAI site. But dot product is faster and easier to code

1 Like

The basic Davinci model fined tuned using the API is completely useless unless you got about a month of messing with it. I did and I’ve since abandoned the whole idea. Especially since a simple one-shot prompt using 003 yields much better results (albeit costing the earth).

My final take on this AI use cases for commercial purposes is that it’s just not ready. When they release the ability to create a model and fine tune it around text-davinci-003 (or chatGPT) then all will be good. My advice, save whatever training sets you have and wait for something that doesn’t drive you round the twist

drive you round the twist

drive you round the twist

drive you round the twist

drive you round the twist

You should reduce the max tokens hyperparameter when calling the api to remove these multiple responses. Maybe try this OpenAI GPT2 or this OpenAI API to see how many tokens your desired responses are. Also, your fine tuned model will work better when you format your prompts like the ones in you chat.jsonl.