GPT-4 vs GPT-4o? Which is the better?

Hi! downstream from yesterday’s OpenAI live, I wanted to compare with you on which model version is actually the best. GPT-4 or GPT-4o? I’m not interested in image generation, or document/speech analysis, I’m really interested from a hyperparameter point of view, ability to avoid hallucination, understanding and validity of output, in short which one is more reliable in your opinion for a dedicated customer service chatbot? Thank you very much in advance


Writing good prompts is a big part of avoid hallucination, if the model is not given enough context it may invent things. What issues are you facing?

From my experience so far, GPT-4 feels like between GPT-4 and GPT-3.5 when it comes to understanding the prompt. It still has a bit of the “hallucination” like GPT-3.5, but its responses come across more human-like. I’m super excited to see what the next version will bring!


I find this really weird. I played around with it a bunch, and it is very obvious, that GPT-4-turbo is a lot better than GPT-4o. Give it any logic riddle or tell it to act in a certain way, and it fails way more. I think it’s maybe more human, but less intelligent and a lot lot less steerable with system prompts.
In my experience


Same experience here, GPT 4 turbo is much better for step by step tasks. In general it understands much better the prompt instructions.


What do you expect? It’s so much faster (and I think cheaper too?), that they had to cut corners on the accuracy and reasoning capability, obviously.

imo GPT-4o (overhyped) is a step backward for things that really matter (the “mission”, i.e. better reasoning capability and increased FACTUAL ACCURACY etc.).

EDIT: not that it matters much, but since my harsh/bitter comment here, I did get to play a bit more with GPT4o, including for C++ coding etc. and I’ve been pretty pleased with the generated content. This is of course very anecdotal, but for my use cases it seems pretty much up to par with the older/slower GPT4.


Well, the marketing said it had the same intelligence, I guess that may confuse many of people (including me) on thinking that it was as capable as GPT 4.
I agree on the set backward part.

We can tell people this new thing we trained is GPT-4, and charge 10x as much. Branding!

And its lack of attention.

Start an intriguing chat premise:

Then dump an over-trained trigger on the AI:

OpenAI built in their own jailbreak to any closed-domain AI application you might have been considering.


We actually did a quick analysis on classification, data extraction, and reasoning and learned that GPT4o is definitely better & faster.

  • For complex data extraction tasks, where accuracy is key, both models still fall short of the mark.
  • For classification of customer tickets, GPT4o has the best precision compared to GPT4-Turbo. It still has the best precision when compared to Claude 3 Opus and GPT-4.
  • For reasoning, GPT-4o has improved in tasks like calendar calculations, time and angle calculations, and antonym identification. However, it still struggles with word manipulation, pattern recognition, analogy reasoning, and spatial reasoning.

You can read the whole analysis here: GPT-4o vs GPT-4 Turbo


So far I am finding “turbo” better at finding links to websites on specific subjects. “o” seems to be making a lot more mistakes. Just a quick test from my side though.

GPT-4 is still much better for our complex tasks that require careful reading and proper prompt following. GPT-4-Turbo was OK for remedial tasks or “conversation” but we use GPT-3.5-turbo for that.

GPT-4o is very bad compared to GPT-4 and even GPT-4-turbo for our uses, but we switched to GPT-4o anyway because of the price and have our scripts filter out the terrible outputs we receive sometimes…some of the outputs are random strings that have nothing to do with our prompts. Once 4o gave us information on a Boeing plane specs randomly.

Frustrating to see leaps forward in Image reading (4o is GREAT at that) but large steps back in complex analysis or tasks.

One of our simplest benchmarks is whether a model can answer a Multiple Choice Question of “All of the following are TRUE, EXCEPT:” on a semi-complex topic.

4 fails often but the rest of the models fail every time.


Some rankings and benchmarks on which they are evaluated

LMSYS Leaderboard

LLM benchmarks

I wrote a fairly lengthy document could I use with a GPT that I created. I have had really good success with that. Then, I feed that document into GPT-4o and it goes completely off the rails every time. I basically ask it to analyze the information and index it to make it easily readable. It always starts out really well. When it gets halfway to 3/4 of the way through, it starts throwing all kinds of wild information in there that has absolutely nothing to do with the document. I have tried it more than a hundred times now and it totally destroys it every time, in a bad way. I am really hoping that changes when we are able to use gpt40 to create gpt’s with.

Referring to the APIs:

Both GPT4o and GPT4 Turbo are terrible in comparison to GPT-4 for some things, but in other places GPT4o shares the same terrible logic as GPT-4. GPT4o has been a good supplement for most things that do not involve dealing with analysis and logic, and strict commands, but I often have to switch to GPT-4 for better responses.

  • Example 1 - GPT4o consistently fails to respond properly to system prompts and commands. I gave an instruction for it to catch discrepancies in numbers. If I say the price has increased, but the price has decreased, GPT4o was instructed to not assume I am right and correct me. GPT4o ignored the system prompt. I changed it to GPT4 and GPT4 100% got it right, GPT4o proceeded as if I were correct.
  • Example 2 - If GPT4o and I are talking and I give it a new instruction at the end or beginning of a text in a conversation, it ignores what I have said and does what it defaults to do. It often repeats everything I wrote. If I write a paragraph and tell it not to do something, it repeats the same paragraph I gave it to edit, even with no changes, and even after I told it not to edit or repeat what I have said. If I tell it confirm before doing something, it does it and then asks if that is what I want, but GPT 4 often immediately gets what I asked for and does it right.

I have been extremely confused by all the hype surrounding GPT4o.

  • How can Claude X be better than GPT-4 but GPT4o which is worse than GPT4 in many regards is better than Claude X.
  • How can GPT4o be better when it doesn’t listen to prompts, and fails repeated tasks or in some regards is even more literal than it’s predecessors.

At times I have found GPT4o quicker, faster, and infuriating. It sometimes gives equal output to GPT4, but it is not very good at reasoning and logic.

  • Example 3 - I wrote a lie, and added several winks after it. I added a note to tell GPT4o that the statement written was a lie, and asked to see if it could pick up the context of what the winks meant. Neither GPT4 or GPT4o were able to grasp that logic, but with more specific prompting GPT4 got it, and GPT4o was still confused. With GPT4o saying the winks were positive and referencing the statement as if it were the truth, even with the added context.

Because GPT4o is cheaper and sometimes equivalent to GPT4 (which is at times also a box of rocks), I find myself switching between GPT4o and GPT4 for the same types of conversations that require different types of analysis.

If people were able to get better responses with GPT4o then I need more information:

  • Is this model the API version or the Chat version
  • What parameters -(temperature, sampling, and other settings) are being used? (how can I repeat results)
  • What prompts and tasks are actually being thrown at it during analysis
  • Is it better than GPT-4 (GPT4 Turbo is worse than a box of rocks, so being better than GPT4 Turbo but not better than GPT 4 is not the best starting point)

So far no one has been able to give me this information and I am left baffled at where all this hype, often, from trusted sources are coming from. It feels like the “Asch conformity experiment”, where even when you know something is not true, the fact that everyone insists it is true pushes you to agree with them. I specifically went on a search just to figure out if I was actually going crazy with how bad GPT-4o actually is.

