Yesterday I trained Da-Vinci model on 30 samples as a quick test to refine the outputs I’m getting from Da-Vinci 002 prompt (which is pretty good). I’m hoping to get it closer to being 100% perfect.
The Input prompt is very much geared towards Davinci Instruct 002 capabilities. No other model would be smart enough to break down instructions and context provided to write meaningful text.
I was disappointed with the model output, after training it clearly couldn’t handle Davinci 002 Instruct prompts and returned wacky gibberish not suitable for production.
Am I barking down the wrong tree trying to fine-tune Davinci with a prompt only capable of being handled by a 002 model? It will take some time to increase the sample size from 30 > 200. Is this a waste of time from offset?
How does your fine tune model performed when prompted with an extensive multi-shot example? I think that would be a good test to see if fine tuning on an adequate data set would be worthwhile.
30 samples is way below the minimum requirement of 200 so you will see a drastic difference in performance. Also keep in mind that performance increases with exponentially more samples. That is to say that 400 samples will be one notch better than 200 samples, and 800 samples will be one notch better. The documentation makes me believe that you start to get diminishing returns at around 1000 samples.
In short, your small experiment is off by an order of magnitude or so before you can judge how effective your data is
Hi @jhsmith12345 - yes it’s pretty good with zero shot prompt using instruct. Just wanted to see if I could squeeze extra performance by fine-tuning by providing input / output pattern to follow to increase consistency in response.
I’ll try out adding extra example and experiment with few-shot.
@daveshapautomator Thanks for the advice. I thought it would be I was hoping to give a few input / output examples so it’s more consistent with the structure of the text written such as always having two line breaks between each sentence, sign off with bye. Fine-tuned on 30 samples was way worse than 1 well crafted prompt with 002 model.
I was still expecting it to get the output I expected instead it just went off on a mad one using text from input but it was just going a bit crazy.