Fine-tuning doesn't seem to improve quality for me

for a while, i have been trying to fine-tune the davinci model by providing it a list of about a dozen pieces of information in a .JSONL file. however, when i then test the customized model, it doesn’t perform any better than the vanilla text-davinci-003 model. even when i use the exact same prompt string values that i used during training, it still seems no better off. i also tried to observe the log_probabilities of my custom model vs. vanilla version, and there also i’m not observing any particular improvement. am i maybe missing some crucial point here? i thought fine-tuning was a way for me to supply a list of key, value pairs to train the model to better associate the key with the value?

If you want to “burn in” key-value associations with a fine-tune, you need to increase the epochs from the default 4 to something higher like 16 or more.

But I’m not sure of the use case, and why not just prompt these values? Is it a window size limitation?

1 Like

i tried to do this, to “burn in” , using the --n_epochs 32 parameter to set the epoch value to 32. and yet the final customized model still fails to work properly. when i test the new model by supplying it exactly the prompt i trained it on, it gives some random response…

Are you feeding the prompt with the same stop sequence (something like ‘\n\n###\n\n’) that you fine-tuned with? Also, is your temperature low or high?

actually it seems like i finally got it to work. my jUnit test had an extra \n in it. basically after i removed extraneous carriage-return characters in both my .jsonl file as well as in my jUnit tester, i started getting successful results. but i’m not sure if it was a combination of that plus setting the epoch value to 32 as you suggested. i’m now trying it with the default epoch value of 4, to see if maybe the whole time my issue was having some extraneous characters…


yes, so basically it appears that both are necessary: setting the epochs to a value higher than 4 AND making sure there’s no extraneous characters in the training set. i tried to keep the good data set and reduce epochs to 4, and it again started to not produce good results, so definitely your suggestion of upping the epochs value is working. thanks for the suggestion!


More clean data always helps too. Are you just using 12 prompt/completions in the dataset?

The more you go up in epochs, the more you should watch out for overfitting too…

1 Like