As demonstrated, the classification/accuracy metric is 0.814. I attempted to verify this by iterating through the validation data, and manually classifying each prompt. However, the predicted response only matched the actual response about 30% of the time. Does anyone have any feedback to resolve this issue? Why is the run summary’s classification/accuracy metric so high compared to my own observed metric?
Thank you for your response! I just tried using the base davinci model and it performed even worse (0% accuracy), which makes sense—considering that it was never trained on format. Nonetheless, even if format isn’t being considered, this model also performed poorly by mostly picking just a single category
I forgot to mention that the predicted response is 95% the same response. I don’t quite understand why, since the data is not too concentrated in any particular category
The model was curie and there were 7 classes. All my output classes are numeric categories (1,2,3,etc.). Has anyone worked with OpenAI multi-class classification and fine-tuning that experienced similar issues, or has any other suggestions?