Interpreting fine-tuning run summary

dr4 · July 12, 2022, 3:20pm

I am fine-tuning a multi-class classification model with 7 classes, and I received the following run summary:

As demonstrated, the classification/accuracy metric is 0.814. I attempted to verify this by iterating through the validation data, and manually classifying each prompt. However, the predicted response only matched the actual response about 30% of the time. Does anyone have any feedback to resolve this issue? Why is the run summary’s classification/accuracy metric so high compared to my own observed metric?

jhsmith12345 · July 12, 2022, 5:00pm

How does it do on DaVinci instruct without fine tuning?

SaturnProductions · July 12, 2022, 6:29pm

It sounds like the temp might be too high? What was the value?

dr4 · July 12, 2022, 7:58pm

Thank you for your response! I just tried using the base davinci model and it performed even worse (0% accuracy), which makes sense—considering that it was never trained on format. Nonetheless, even if format isn’t being considered, this model also performed poorly by mostly picking just a single category

dr4 · July 12, 2022, 8:00pm

I forgot to mention that the predicted response is 95% the same response. I don’t quite understand why, since the data is not too concentrated in any particular category

dr4 · July 12, 2022, 8:00pm

Thank you for your response! The temperature is set at 0.

jordn · July 12, 2022, 9:06pm

Can you share with us your CLI command?

This is a guess after reading the docs… but are you using single tokens as your output classes? I understand there could be weird behaviour if not.

I’m also wondering if this could be something to do with it being a multi-class problem too.

dr4 · July 13, 2022, 9:40pm

Hey, thanks for your response! Here is my CLI command to create the fine-tune:

The model was curie and there were 7 classes. All my output classes are numeric categories (1,2,3,etc.). Has anyone worked with OpenAI multi-class classification and fine-tuning that experienced similar issues, or has any other suggestions?

jhsmith12345 · July 14, 2022, 10:07pm

Would you mind sharing a few examples of your fine-tune training data?

Topic		Replies	Views
Finetuned Classification providing invalid response as classification Prompting	5	957	November 8, 2022
Finetuning experiments (How not to finetune) API fine-tuning	2	765	December 21, 2023
Finetuning and getting F1 scores API	9	2690	May 19, 2023
Ada model fined tuned for classification hets delusional when fed with thousands of records Prompting fine-tuning , classification	7	1057	May 23, 2023
GPT3 finetuning for large text summaries Prompting	8	1337	December 24, 2023

Interpreting fine-tuning run summary

Related topics