Hypothetical Token-increase Strategy .

lucid.dev · March 17, 2025, 3:03am

Are you only seeing this on 4o?

I just ran a test with three models in the same conversation (I can switch models mid-conversation so that I can test sending the entire context window to each model and seeing what the difference in reply is):

These tests are all over API on an in-house system, all “real models” (not an -preview models)

Definitely confirmation related to 4o’s performance, at least in relation to the specific topical nature. Originally a few months ago I used 4o extensively for coding (before o1 and o3 were available) and I remember usually never getting replies above 2.5k even in extreme scenarios. So I personally haven’t seen much change with that, at least in that kind of context.

(same context window being sent to each (a series of around 10 messages/response pairs)

o1 test:
Request Model: o1
Usage: Prompt: 68841, Completion: 9327, Total: 78168
(reasoning was about 1k and actual response was about 8.25k)

4o test:
Request Model: gpt-4o
Usage: Prompt: 68669, Completion: 1835, Total: 70504

o3-mini test:
Request Model: o3-mini
Usage: Prompt: 68854, Completion: 11210, Total: 80064
(reasoning was about 1.5k actual response was 9.6k)

This however, was in a context window that was a large coding task with about a dozen full code documents being shared, along with blueprints and multiple debugging stages (the prompt in this case was a series of ten tests and results, asking the model to compare the test results to the blueprint and then adjust the code accordingly).

I’ve noticed that o3-mini is really the best at giving the longer responses, but o1 is actually better at accurately completing the tasks if it’s a complex multi-document task. 4o I remember getting longer responses from if I provided highly structured input and instructions related to providing the output in a highly structured manner. But back in the day using 4o for coding, I definitely found that I had to restrict my tasks, even if I was providing large amounts of input data, that the success of the output depended on how specific and non-multi-faceted the task was.

j9jv9hmbcz · March 17, 2025, 9:43pm

What a great idea, I will definitely run something similar!

Topic		Replies	Views
GPT4-Turbo more "stupid/lazy" - It's not a GPT4 API gpt-4 , chatgpt , gpt-4-turbo	33	11332	March 18, 2024
Medium Post: Grounding LLM's - Part 1 Prompting hallucinations	21	2026	August 23, 2024
Loss of logic In the ChatGPT May 3 Version Community gpt-4 , chatgpt	18	2405	December 20, 2023
Seen anything novel by o1-preview? Community o1-preview	15	1927	September 16, 2024
O1 Tips & Tricks: Share Your Best Practices Here API	10	4277	September 18, 2024

Hypothetical Token-increase Strategy .

Related topics