Gpt-4o-mini has terrible results in comparison to gpt-4o on text summarization task?

Here is a question I posted sharing my “messy” dictionary definitions, and how OpenAI’s API using gpt-4o converted them into a nice array of simplified definitions.

Takes this:

*அக்கடி akkaṭi , n. cf. akka + அடி. Difficulty, trouble in a voyage or journey, peril; அலைவு. எனக்கு அக்கடியா யிருக்கிறது. (R.)

And produces this:

{
  "term": "அக்கடி",
  "definitions": [
    "difficulty",
    "trouble",
    "peril"
  ],
  "gloss": "difficulty",
  "role": "noun"
}

As you can see, the results are impresive, and high quality.

However, it’s expensive, about $100/20k API calls in my case.

People are touting gpt-4o-mini as “equivalent in quality” to gpt-4o, at 1/10th the price, so that would be great! I could afford $10, but not $100 (multiplied by many more definitions cleaning).

But the results with mini were terrible, to say the least.

  • gpt-4o-mini included things like ;-: random punctuation characters (I asked for pure lowercase text, which gpt-4o gives all the time).
  • gpt-4o-mini included non-English characters in the output (like random letters, meaningless non-English characters, etc.).
  • And generally, the summarizations were less accurate.

Is this to be expected? How can I take my prompt (in the linked question), and make it get the exact same quality results with mini? Or why isn’t mini capable of handling this task as nice as gpt-4o?

3 Likes

Meaningful compression is an act of intelligence. Less capable models would likely not give the same quality of compression.

OpenAI claims that the quality is just above 3.5 turbo and not comparable to gpt-4o yet. May be their next update will have the quality equivalent to 4o.

I see the same issue, in my case we do a classification of incoming emails, and map them to a project and employee, in gpt-4o this is almost 100% correct, in gpt-4o-mini this is only correct in 30% of the cases. So for really accurate tasks, this model is not really helpful.

3 Likes

Same here. I’m scoring articles from 0 to 10. With 3.5, results are good enough, and the differences vs. the 4-o don’t justify the difference in cost. But with 4-o-mini, I’m getting very optimistic scores. Articles that were scored between 0 and 4 and therefore discarded are now scored with scores bigger than 8.

I also did some additional tests with limiting the temperature (even going to 0), but helas not much of an effect, really bad in classification. In our case it needs to give a projectID (from a list that I gave) and it mostly comes back with GUID ids that are completely wrong, or changed partially from the list that I gave it, even in temperature 0.

So really not useful in production.

1 Like

Fine-tune for 4-o-mini has been released with 2M free daily tokens. I hope the combination of 4-o power and fine-tuning can drastically improve the results :crossed_fingers:

1 Like

Even if we do not need the same performance as GPT-4o.
It would be enlightening to ascertain the process by which GPT-4o mini generates such summary errors resulting from labor-saving calculations.

Same here, Chat GPT 4o Mini doesn’t compare by far to Chat GPT 3,5 turbo. I am a free user and the text summarization is hardly usable for me any more.