Are you only seeing this on 4o?
I just ran a test with three models in the same conversation (I can switch models mid-conversation so that I can test sending the entire context window to each model and seeing what the difference in reply is):
These tests are all over API on an in-house system, all “real models” (not an -preview models)
Definitely confirmation related to 4o’s performance, at least in relation to the specific topical nature. Originally a few months ago I used 4o extensively for coding (before o1 and o3 were available) and I remember usually never getting replies above 2.5k even in extreme scenarios. So I personally haven’t seen much change with that, at least in that kind of context.
(same context window being sent to each (a series of around 10 messages/response pairs)
o1 test:
Request Model: o1
Usage: Prompt: 68841, Completion: 9327, Total: 78168
(reasoning was about 1k and actual response was about 8.25k)
4o test:
Request Model: gpt-4o
Usage: Prompt: 68669, Completion: 1835, Total: 70504
o3-mini test:
Request Model: o3-mini
Usage: Prompt: 68854, Completion: 11210, Total: 80064
(reasoning was about 1.5k actual response was 9.6k)
This however, was in a context window that was a large coding task with about a dozen full code documents being shared, along with blueprints and multiple debugging stages (the prompt in this case was a series of ten tests and results, asking the model to compare the test results to the blueprint and then adjust the code accordingly).
I’ve noticed that o3-mini is really the best at giving the longer responses, but o1 is actually better at accurately completing the tasks if it’s a complex multi-document task. 4o I remember getting longer responses from if I provided highly structured input and instructions related to providing the output in a highly structured manner. But back in the day using 4o for coding, I definitely found that I had to restrict my tasks, even if I was providing large amounts of input data, that the success of the output depended on how specific and non-multi-faceted the task was.