Struggling with poor performance on fine-tuned davinci model

vini · January 4, 2023, 6:03pm

Hi community,

I’m having an issue with my fine-tuned davinci model where it is performing very poorly on completions with questions that are not related to the training data. For example, if I ask for business ideas on a model that was fine-tuned for therapist conversation, the results are very poor.

I was expecting the model to perform more like text-davinci-003, but it is performing closer to the original davinci model when facing unexacting prompts.

I’m wondering if I’m missing something or if there are any strategies that I could try to improve the model’s performance on these types of inputs. Any thoughts or suggestions would be greatly appreciated.

Thanks you so much.

vini · January 4, 2023, 6:16pm

here my model working the way it should work…

vini · January 4, 2023, 6:17pm

here the model behaving differently than I expected…

vini · January 4, 2023, 6:18pm

very much like the davinci base model

vini · January 4, 2023, 6:18pm

differently from the text-davinci-003, that handle the prompt nicely

PaulBellow · January 4, 2023, 6:32pm

Can you give us an example or two of your training data? How big was your dataset? Did you change any fine-tuning settings?

vini · January 4, 2023, 6:51pm

Hi Paul. Sure!

Here is my training data. It’s not big. Around 250 examples.

I did not change fine-tuning settings. Here how I called the API:

PaulBellow · January 4, 2023, 7:03pm

Thanks. The training data looks okay, but…

Might not be enough samples. If 250 isn’t working, try to use 500… then 1,000…
You might also want to use a stop token at the end of the compmletions… (something like <|endoftext|>) so you can use that in the generation later)

What about the settings when you’re trying to generate - ie temperature, frequency_penalty, etc…

Might just not be enough samples, though.

Hope this helps!

vini · January 4, 2023, 8:57pm

Got it. Thank you for helping out, Paul.

Let’s assume I do not have enough training data for the model perform as expected in terms of getting the user the right answer, as the user “therapist”.

I’m wondering if the base model that was fine-tuned is not “text-davinci-003”, but rather the most basic version “davinci”. If that is the case, it may explain the model’s poor performance.

Could you tell me how your fine-tuned models perform with other prompts? Do they exhibit similar difficulties, or is this issue specific to the “therapist” prompt?

Mikiane · January 4, 2023, 9:03pm

I have the same problem with a dataset of more than 500 lines.
The OpenAI support advised me to use Embeddings and not Completions for better result with Q&As.

Everything is explained here Question answering using embeddings-based search | OpenAI Cookbook

gc · January 4, 2023, 9:52pm

Actually, that cookbook isn’t very easy to understand. You might want to start with this:

https://thoughtblogger.com/openai-embedding-tutorial/

Mikiane · January 6, 2023, 5:22pm

I was struggling with a bug (manipulating arrays) in the code available here em013 Doing and Embedded Semantic Search - YouTube

It works better using

df['similarities'] = df.ada_embedding.apply(lambda x: np.dot(x, searchvector) / (np.linalg.norm(x) * np.linalg.norm(searchvector)))

Than

df['similarities'] = df.ada_embedding.apply(lambda x: cosine_similarity(x, searchvector))

Cheers

raymonddavey · January 6, 2023, 6:29pm

Thanks for the feedback. You can also use

df[‘similarities’] = df.ada_embedding.apply(lambda x: np.dot(x, searchvector) )

Because the lengths of the vectors are normalized to a length of 1. So you will always be dividing by 1

I left the cosine_similarity function in the video so the example matched the examples on the OpenAI site. But dot product is faster and easier to code

drinkingteddy · January 15, 2023, 1:26pm

The basic Davinci model fined tuned using the API is completely useless unless you got about a month of messing with it. I did and I’ve since abandoned the whole idea. Especially since a simple one-shot prompt using 003 yields much better results (albeit costing the earth).

My final take on this AI use cases for commercial purposes is that it’s just not ready. When they release the ability to create a model and fine tune it around text-davinci-003 (or chatGPT) then all will be good. My advice, save whatever training sets you have and wait for something that doesn’t drive you round the twist

drive you round the twist

mgole · February 7, 2023, 5:28pm

You should reduce the max tokens hyperparameter when calling the api to remove these multiple responses. Maybe try this OpenAI GPT2 or this OpenAI API to see how many tokens your desired responses are. Also, your fine tuned model will work better when you format your prompts like the ones in you chat.jsonl.

Topic		Replies	Views
Got awful results after fine-tuning API	11	3215	December 1, 2022
Fine Tuned Chatbot forgets how to output summary of conversation API	9	1863	December 18, 2023
Finetune model completion cut off too short Prompting	7	3941	January 17, 2023
Strange behavior of a fine tuned model API	6	1980	December 20, 2023
Fine tuned model providing worse output Prompting	6	2067	March 7, 2023

Struggling with poor performance on fine-tuned davinci model

Related topics