My Honest Take on GPT-4o vs GPT-4-turbo-2024-04-09 vs GPT-4-1106

After extensively using these three versions of GPT-4, I’ll share my findings.

Firstly, it’s important to highlight some facts:

  • The original GPT-4, released in March 2023, is version 0314. It underwent several updates in 2023, with the latest being GPT-4-1106, launched at the OpenAI DevDay. This version is notable for its knowledge of events up until 2023 and an increased context window of 128k tokens.

  • In 2024, the gpt-4-turbo-2024-04-09 update was released, promising performance gains in reasoning.

  • More recently, the GPT-4o model was introduced with a new architecture, being faster and multimodal.

My perception:

For mathematics and logical reasoning, the GPT-4-turbo-2024-04-09 is superior. Since its release, this has become apparent in my personal tests compared to GPT-4-1106. GPT-4o, on the other hand, did not maintain the high standard of performance seen in GPT-4-turbo-2024-04-09 in these domains. Although GPT-4o seems slightly superior to GPT-4-1106 here, on some questions, it enters an endless loop, repeating the same segments multiple times.

When it comes to coding, gpt-4-turbo-2024-04-09 again takes the lead, especially with intricate tasks—it’s my go-to model. GPT-4o becomes my choice for generating extensive code since it’s not lazy. However, for corrections and edits, GPT-4o can be repetitive and generate extraneous code. Thus, when I need a long code, I start with GPT-4o for bulk work and switch to gpt-4-turbo-2024-04-09 when refining is needed.

For translation and writing, GPT-4-1106 is superior. I’ve observed that GPT-4-turbo-2024-04-09 is very sensitive to the Frequency penalty and Presence penalty parameters, and it often includes odd terms or phrases in the creation of long texts. GPT-4-1106 does not make these mistakes. In translation, GPT-4-1106 tends to use more fluid and native-like sentences, whereas GPT-4-turbo-2024-04-09 is less versatile and sounds less natural. I haven’t extensively tested GPT-4o in this niche yet, but the few texts I requested GPT-4o to generate seemed slightly inferior to GPT-4-1106.

For explaining concepts and learning in general, all are quite similar, but GPT-4o is the least reliable in terms of hallucinations. Moreover, when a specific detail makes all the difference in the prompt and needs to be considered to avoid common sense, GPT-4o has the least capacity to perceive these nuances.

In summary, despite the lmsys rankings and some benchmarks released by OpenAI showing GPT-4o as superior in most cases, I don’t share this perception. Most of the time, I still use earlier versions of GPT-4, including the original 0314, which has less lobotomy and verbosity.

It appears that GPT-4o has been fine-tuned for broader appeal, especially by not being lazy, and this artificially inflates its scores. One of the issues with GPT-4o is that it exhibits certain behaviors that are alarming and give a terribly poor impression of stupidity. It seems to be a model with fewer parameters, less generalizable, but trained with more selectively chosen data, which brings good performance in very specific situations, yet it struggles when stepping out of its comfort zone.

Caveat: No single model universally reigns supreme now. I’ve seen “inferior” models solve problems that “superior” ones couldn’t crack. Currently, the proliferation of GPT-4s can leave anyone confused, and I hope this will soon end with the arrival of GPT-5.

Thoughts?

1 Like

What parameters do you use for code generation? (temperature, top_p, frequency_penalty, presence_penalty)

For coding, I typically use temperature = 0, top_p = 1, frequency_penalty = 0, presence_penalty = 0.

I only increase the temperature when the code requires some level of creativity.