Gpt-4o-mini has terrible results in comparison to gpt-4o on text summarization task?

lancejpollard · July 23, 2024, 12:50am

Here is a question I posted sharing my “messy” dictionary definitions, and how OpenAI’s API using gpt-4o converted them into a nice array of simplified definitions.

Takes this:

*அக்கடி akkaṭi , n. cf. akka + அடி. Difficulty, trouble in a voyage or journey, peril; அலைவு. எனக்கு அக்கடியா யிருக்கிறது. (R.)

And produces this:

{
  "term": "அக்கடி",
  "definitions": [
    "difficulty",
    "trouble",
    "peril"
  ],
  "gloss": "difficulty",
  "role": "noun"
}

As you can see, the results are impresive, and high quality.

However, it’s expensive, about $100/20k API calls in my case.

People are touting gpt-4o-mini as “equivalent in quality” to gpt-4o, at 1/10th the price, so that would be great! I could afford $10, but not $100 (multiplied by many more definitions cleaning).

But the results with mini were terrible, to say the least.

gpt-4o-mini included things like ;-: random punctuation characters (I asked for pure lowercase text, which gpt-4o gives all the time).
gpt-4o-mini included non-English characters in the output (like random letters, meaningless non-English characters, etc.).
And generally, the summarizations were less accurate.

Is this to be expected? How can I take my prompt (in the linked question), and make it get the exact same quality results with mini? Or why isn’t mini capable of handling this task as nice as gpt-4o?

icdev2dev · July 23, 2024, 12:59am

Meaningful compression is an act of intelligence. Less capable models would likely not give the same quality of compression.

prem2282 · July 23, 2024, 3:53am

OpenAI claims that the quality is just above 3.5 turbo and not comparable to gpt-4o yet. May be their next update will have the quality equivalent to 4o.

reinout.mechant · July 23, 2024, 12:46pm

I see the same issue, in my case we do a classification of incoming emails, and map them to a project and employee, in gpt-4o this is almost 100% correct, in gpt-4o-mini this is only correct in 30% of the cases. So for really accurate tasks, this model is not really helpful.

pol.marti · July 23, 2024, 3:43pm

Same here. I’m scoring articles from 0 to 10. With 3.5, results are good enough, and the differences vs. the 4-o don’t justify the difference in cost. But with 4-o-mini, I’m getting very optimistic scores. Articles that were scored between 0 and 4 and therefore discarded are now scored with scores bigger than 8.

reinout.mechant · July 23, 2024, 9:19pm

I also did some additional tests with limiting the temperature (even going to 0), but helas not much of an effect, really bad in classification. In our case it needs to give a projectID (from a list that I gave) and it mostly comes back with GUID ids that are completely wrong, or changed partially from the list that I gave it, even in temperature 0.

So really not useful in production.

pol.marti · July 23, 2024, 10:25pm

Fine-tune for 4-o-mini has been released with 2M free daily tokens. I hope the combination of 4-o power and fine-tuning can drastically improve the results

amanohotori · July 26, 2024, 5:59pm

Even if we do not need the same performance as GPT-4o.
It would be enlightening to ascertain the process by which GPT-4o mini generates such summary errors resulting from labor-saving calculations.

smilinthyme · August 13, 2024, 1:42pm

Same here, Chat GPT 4o Mini doesn’t compare by far to Chat GPT 3,5 turbo. I am a free user and the text summarization is hardly usable for me any more.

Topic		Replies	Views
Feedback on gpt-4o-mini, my results are poor for data mining Feedback	2	166	August 26, 2024
For synthetic data generation, does o3-mini, o1, or 4o generally fare better? API	2	378	February 7, 2025
When do you actually want to use 4o vs. 4o-mini API api	4	1869	January 24, 2025
What is the difference between gpt-40-mini and gpt-4o model? API chatgpt , api	3	1378	February 5, 2025
GPT 4o mini performing much worse than GPT-3.5-16k Bugs	0	155	August 18, 2024

Gpt-4o-mini has terrible results in comparison to gpt-4o on text summarization task?

Related topics