Seeing that gpt-4o-2024-08-06 performs worse (or lets say less comprehensive) than gpt-4o sometimes. Is that by design or an anecdotal observation?
When you simply specify “gpt-4o” as the model name, it points to “gpt-4o-2024-05-13”.
Since “gpt-4o-2024-05-13” and “gpt-4o-2024-08-06” are slightly different models, the differences you observe may be due to this.
Yeah, I understand that. I was just curious that enabling structured outputs and larger context windows in gpt-4o-2024-08-06 has taken away some of the reasoning power away from gpt-4o / gpt-4o-2024-05-13? Has anyone else seen that in dev or production or is this anecdotal observation?
Yes, I am encountering an issue when asking the model to select a category for an item from a predefined list of approximately 50 categories, which includes a “miscellaneous” option. Some items are difficult to match, so I have added instructions for the model to prefer broad matches or, at the very least, choose the “miscellaneous” category instead of returning an error. However, gpt-4o-2024-08-06 consistently fails to follow this instruction, even with a wide range of temperatures (including 0), while gpt-4o-2024-05-13 performs well under the same conditions.
After reimplementing the request to include the JSON specification as a structured output, rather than specifying it in the prompt, the results seem to have improved. I also noticed a typo in the category list for the miscellaneous category, which was correct in the prompt—this might have had some impact. Overall, after making these two changes, I’m no longer sure that gpt-4o-2024-08-06 performs worse.