Interpreting fine-tuning run summary

I am fine-tuning a multi-class classification model with 7 classes, and I received the following run summary:

As demonstrated, the classification/accuracy metric is 0.814. I attempted to verify this by iterating through the validation data, and manually classifying each prompt. However, the predicted response only matched the actual response about 30% of the time. Does anyone have any feedback to resolve this issue? Why is the run summary’s classification/accuracy metric so high compared to my own observed metric?

How does it do on DaVinci instruct without fine tuning?

It sounds like the temp might be too high? What was the value?

Thank you for your response! I just tried using the base davinci model and it performed even worse (0% accuracy), which makes sense—considering that it was never trained on format. Nonetheless, even if format isn’t being considered, this model also performed poorly by mostly picking just a single category

1 Like

I forgot to mention that the predicted response is 95% the same response. I don’t quite understand why, since the data is not too concentrated in any particular category

Thank you for your response! The temperature is set at 0.

Can you share with us your CLI command?

This is a guess after reading the docs… but are you using single tokens as your output classes? I understand there could be weird behaviour if not.

I’m also wondering if this could be something to do with it being a multi-class problem too.

1 Like

Hey, thanks for your response! Here is my CLI command to create the fine-tune:

The model was curie and there were 7 classes. All my output classes are numeric categories (1,2,3,etc.). Has anyone worked with OpenAI multi-class classification and fine-tuning that experienced similar issues, or has any other suggestions?

Would you mind sharing a few examples of your fine-tune training data?