Fine-Tuning stats show good results but fails in practice

anon10827405 · April 5, 2023, 6:48pm

100%. I was actually just thinking last night about how scamming will be much, much more believable. I honestly do fear this. Someone could monitor public registers and use GPT to send out thousands of very believable phishing emails without any effort. It’s wonderful to see someone actually attempting to make a difference. I worry, yet do nothing. So thank you.

Good luck in your endeavor.

Here’s a wonderful tutorial to connect W&B to your training data.

DuckLover · April 14, 2023, 6:43pm

I still could not figure out the problem here. I balanced my data now. I got more then 500 prompts for each class and always use equal amount for each class to fine-tune the model. The results in the statistics are still perfect. How is it possible that prompts from the training data that are classified as clean return a 100% certain phishing verdict. Can anyone explain how the model work and why is this possible?

pomarie · May 7, 2023, 7:34am

I have the same issue – OpenAI’s classification metrics show great validation accuracy (close to 1), but in practice when I manually run the fine-tuned model against the same validation dataset, accuracy is closer to 0.6

gutz · May 19, 2023, 10:56pm

I’ve had similar situations, where the stats are good, but everything gets classified as positive.
You might want to try flipping the positive/negative classes. Build the model around “is not phishing” instead of “is phishing”. It likely won’t work better, but it might help understand the problem.

Topic		Replies	Views
Using the new fine-tunes endpoint for binary classification API fine-tuning , python	10	2197	January 11, 2024
Fine-tuning blocked by moderation system API fine-tuning , api , gpt-4o , gpt-4o-mini	41	2691	June 7, 2025
Issues with Fine-Tuned Babbage-002 Model Returning Incorrect Completions Prompting gpt-4 , chatgpt	13	1810	December 21, 2023
Training loss=good, Validation loss=good API fine-tuning , api , fine-tuning-problems	8	4920	April 5, 2024
Ada model fined tuned for classification hets delusional when fed with thousands of records Prompting fine-tuning , classification	7	1026	May 23, 2023

Fine-Tuning stats show good results but fails in practice

Related topics