To estimate the performance, I would use labeled examples or ‘truth data’ that wasn’t used in training. But with GPT-3, who knows what they used, so you really can’t do this purely, but you can get close to this when looking at your own fine tunes. It pretty much has to be empirically measured, IMO.
As for zero shot, it gives you the most flexibility, but realize there is a hit to cost, since your prompts will be larger. But running a fine tuned model also costs more in general per token (for the smaller prompt), so you should do a cost trade study and see which one makes more cost sense vs. performance and flexibility.
Also another hyper parameter to worry about, and one that EXPLODES your trade space with zero-shot, is the exact language you use in the prompt. You can get drastically different answers if one word is changed in the prompt. But with experimentation, you could make an informed decision about which prompt wording works better … and another annoying thing is that this can vary over the version of the model you are using, even within the same model class such as ‘davinci’. Also along these lines, you have cost differences between ada, babbage, curie, and davinci … but they have performance variations too. Depending on the task and amount of training data, you can do a great job performance-wise with the lower models at the lower cost.
But the good thing is all of the Transfer Learning that occurs during a fine tune or a zero-shot. In the case of fine tuning, your training data set can be a lot smaller, and get good performance, vs training your own model from scratch … this is the big advantage here. Good training data is hard to come by, so why not leverage GPT-3 and create less of it to get an acceptable answer.
As for the training of ‘neutral’ … if you care or rely on neutral as an instantaneous result, then yeah it makes sense to train it. In my case I only cared about ‘negative’ and ‘positive’ and let this map to a set of integers between +/-10, and further averaging it over time. Over time a ‘neutral’ would emerge, but I only cared about the extremes over time.