There would almost certainly need to be a little bit of tinkering on your part to convert the dataset to something appropriate for fine tuning an OpenAI model.
If you decided you were serious about doing the fine-tuning it might be worth your while to reach out to the Prometheus team and ask for guidance.
They might be willing to lend a hand, especially if you offered to let them benchmark the model once it is trained.
I believe this is the link to the actual dataset,
The Feedback-Collection dataset is 99,952 records and 104,203,908 total tokens (by my count), so you would be looking at roughly $834/epoch for fine-tuning a model based on their data. I didn’t see anywhere in the paper that they mention if they held out any of the data for testing/validation or how many epochs they trained for.
I do wonder if one could “get away with” drastically reducing the data so that you don’t need to train on a 1
, 2
, 3
, 4
, and 5
result for each instruction. If each instruction had only one exemplar, could the model “transfer” what it learned about a 4
score from those examples with a 4
score?
Anyway, it may be prohibitively expensive to do this kind of fine-tuning unless you thought babbage-002
would be able to handle this sort of evaluation.
Other Thoughts
The other thing looking at this has highlighted for me is the fact we really need some kind of option for custom loss functions.
I am not sure how effective it would be, as part of what the Prometheus model does is go through essentially a chain-of-thought process before determining the score. But, if you are only looking for a score metric, removing everything but the numeric score value from the response drops the total number of training tokens in an epoch down to 91,929,357 saving almost 12\% and making the per-epoch cost about $735, but at that point I would probably just leave the chain-of-thought fine-tuning in place. (Incidentally this is a place where a custom loss function would come in handy, as predicting a 4
when the “correct” response is a 5
is much better than predicting a 1
.
In the end, it may just be better to use Prometheus as inspiration and come up with your own fine-tuning dataset for your very particular use case.
If you have the resources, filtering the data down to 1/5 of the entries (keeping only one score from each instruction), cutting the response to a single numeric token, and doing a fine-tuning on babbage-002
might be interesting at just over $7.35/epoch.
Anyway, I had just read this paper a few days ago and your question reminded me of it so I wanted to pass it along.