Hi! downstream from yesterday’s OpenAI live, I wanted to compare with you on which model version is actually the best. GPT-4 or GPT-4o? I’m not interested in image generation, or document/speech analysis, I’m really interested from a hyperparameter point of view, ability to avoid hallucination, understanding and validity of output, in short which one is more reliable in your opinion for a dedicated customer service chatbot? Thank you very much in advance
Writing good prompts is a big part of avoid hallucination, if the model is not given enough context it may invent things. What issues are you facing?
From my experience so far, GPT-4 feels like between GPT-4 and GPT-3.5 when it comes to understanding the prompt. It still has a bit of the “hallucination” like GPT-3.5, but its responses come across more human-like. I’m super excited to see what the next version will bring!
I find this really weird. I played around with it a bunch, and it is very obvious, that GPT-4-turbo is a lot better than GPT-4o. Give it any logic riddle or tell it to act in a certain way, and it fails way more. I think it’s maybe more human, but less intelligent and a lot lot less steerable with system prompts.
In my experience
Same experience here, GPT 4 turbo is much better for step by step tasks. In general it understands much better the prompt instructions.
What do you expect? It’s so much faster (and I think cheaper too?), that they had to cut corners on the accuracy and reasoning capability, obviously.
imo GPT-4o (overhyped) is a step backward for things that really matter (the “mission”, i.e. better reasoning capability and increased FACTUAL ACCURACY etc.).
EDIT: not that it matters much, but since my harsh/bitter comment here, I did get to play a bit more with GPT4o, including for C++ coding etc. and I’ve been pretty pleased with the generated content. This is of course very anecdotal, but for my use cases it seems pretty much up to par with the older/slower GPT4.
Well, the marketing said it had the same intelligence, I guess that may confuse many of people (including me) on thinking that it was as capable as GPT 4.
I agree on the set backward part.
We can tell people this new thing we trained is GPT-4, and charge 10x as much. Branding!
And its lack of attention.
Start an intriguing chat premise:
Then dump an over-trained trigger on the AI:
OpenAI built in their own jailbreak to any closed-domain AI application you might have been considering.
We actually did a quick analysis on classification, data extraction, and reasoning and learned that GPT4o is definitely better & faster.
- For complex data extraction tasks, where accuracy is key, both models still fall short of the mark.
- For classification of customer tickets, GPT4o has the best precision compared to GPT4-Turbo. It still has the best precision when compared to Claude 3 Opus and GPT-4.
- For reasoning, GPT-4o has improved in tasks like calendar calculations, time and angle calculations, and antonym identification. However, it still struggles with word manipulation, pattern recognition, analogy reasoning, and spatial reasoning.
You can read the whole analysis here: GPT-4o vs GPT-4 Turbo
So far I am finding “turbo” better at finding links to websites on specific subjects. “o” seems to be making a lot more mistakes. Just a quick test from my side though.
GPT-4 is still much better for our complex tasks that require careful reading and proper prompt following. GPT-4-Turbo was OK for remedial tasks or “conversation” but we use GPT-3.5-turbo for that.
GPT-4o is very bad compared to GPT-4 and even GPT-4-turbo for our uses, but we switched to GPT-4o anyway because of the price and have our scripts filter out the terrible outputs we receive sometimes…some of the outputs are random strings that have nothing to do with our prompts. Once 4o gave us information on a Boeing plane specs randomly.
Frustrating to see leaps forward in Image reading (4o is GREAT at that) but large steps back in complex analysis or tasks.
One of our simplest benchmarks is whether a model can answer a Multiple Choice Question of “All of the following are TRUE, EXCEPT:” on a semi-complex topic.
4 fails often but the rest of the models fail every time.
Some rankings and benchmarks on which they are evaluated
I wrote a fairly lengthy document could I use with a GPT that I created. I have had really good success with that. Then, I feed that document into GPT-4o and it goes completely off the rails every time. I basically ask it to analyze the information and index it to make it easily readable. It always starts out really well. When it gets halfway to 3/4 of the way through, it starts throwing all kinds of wild information in there that has absolutely nothing to do with the document. I have tried it more than a hundred times now and it totally destroys it every time, in a bad way. I am really hoping that changes when we are able to use gpt40 to create gpt’s with.
Referring to the APIs:
Both GPT4o and GPT4 Turbo are terrible in comparison to GPT-4 for some things, but in other places GPT4o shares the same terrible logic as GPT-4. GPT4o has been a good supplement for most things that do not involve dealing with analysis and logic, and strict commands, but I often have to switch to GPT-4 for better responses.
- Example 1 - GPT4o consistently fails to respond properly to system prompts and commands. I gave an instruction for it to catch discrepancies in numbers. If I say the price has increased, but the price has decreased, GPT4o was instructed to not assume I am right and correct me. GPT4o ignored the system prompt. I changed it to GPT4 and GPT4 100% got it right, GPT4o proceeded as if I were correct.
- Example 2 - If GPT4o and I are talking and I give it a new instruction at the end or beginning of a text in a conversation, it ignores what I have said and does what it defaults to do. It often repeats everything I wrote. If I write a paragraph and tell it not to do something, it repeats the same paragraph I gave it to edit, even with no changes, and even after I told it not to edit or repeat what I have said. If I tell it confirm before doing something, it does it and then asks if that is what I want, but GPT 4 often immediately gets what I asked for and does it right.
I have been extremely confused by all the hype surrounding GPT4o.
- How can Claude X be better than GPT-4 but GPT4o which is worse than GPT4 in many regards is better than Claude X.
- How can GPT4o be better when it doesn’t listen to prompts, and fails repeated tasks or in some regards is even more literal than it’s predecessors.
At times I have found GPT4o quicker, faster, and infuriating. It sometimes gives equal output to GPT4, but it is not very good at reasoning and logic.
- Example 3 - I wrote a lie, and added several winks after it. I added a note to tell GPT4o that the statement written was a lie, and asked to see if it could pick up the context of what the winks meant. Neither GPT4 or GPT4o were able to grasp that logic, but with more specific prompting GPT4 got it, and GPT4o was still confused. With GPT4o saying the winks were positive and referencing the statement as if it were the truth, even with the added context.
Because GPT4o is cheaper and sometimes equivalent to GPT4 (which is at times also a box of rocks), I find myself switching between GPT4o and GPT4 for the same types of conversations that require different types of analysis.
If people were able to get better responses with GPT4o then I need more information:
- Is this model the API version or the Chat version
- What parameters -(temperature, sampling, and other settings) are being used? (how can I repeat results)
- What prompts and tasks are actually being thrown at it during analysis
- Is it better than GPT-4 (GPT4 Turbo is worse than a box of rocks, so being better than GPT4 Turbo but not better than GPT 4 is not the best starting point)
So far no one has been able to give me this information and I am left baffled at where all this hype, often, from trusted sources are coming from. It feels like the “Asch conformity experiment”, where even when you know something is not true, the fact that everyone insists it is true pushes you to agree with them. I specifically went on a search just to figure out if I was actually going crazy with how bad GPT-4o actually is.
I second all that Y4ZM said, and I’m going to add my own (anecdotal) account.
I use both the API and the chat all day long, every day. I have done so for over a year, so I would say I am very adept at prompting.
The API is used in a CAT tool I have developed and of which I am also a user.
The chat I use all day long for either solving one-off coding problems or for translation support.
I feel that GPT4o is marginally better in its translation choices both in the API and on the chat, and every new model since 3.5 has made progress in this respect, to varying degrees.
But for help with coding (on the chat), GPT4o is incredibly bad. I should say infuriatingly bad. I strive to be more stubborn than a computer, so whenever I need some help I start off with GPT4o.
Often I run into a dead end with GPT4o; then I downshift to GPT4, start from scratch, and I’m able to solve the problem within 5 minutes - and not because I had already eliminated a bunch of alternatives with 4o, but simply because 4 is much more apt at building on a train of thought.
Interestingly (still in coding), sometimes GPT4 is having trouble so I downshift to GPT3.5, and I find that 3.5’s suggestions are much more helpful. Even when in the end I can’t find a solution that would be reasonable in the real world, I find that GPT3.5 suggestions are much more to the point and insightful than either of the other two.
GPT4o flat out ignores many of my instructions, refuses to change track (if I say “let’s try a different approach…”), repeats ad nauseum suggestions I have repeatedly told it don’t work, and is a lot worse than GPT4 at taking into account the history of the conversation.
The new “memory” feature in GPT4o seems to be just an extension of the user settings introduced with GPT4. And GPT4 and 4o are equally bad at abiding by those instructions - they will use them at the beginning of a conversation, but soon start ignoring them. I told 4o to remember that “following these instructions is more important than providing a good answer” and it seemed to save that into the memory, and then went right back to ignoring them.
But, like I said in the beginning, all this is anecdotal. I have neither the time nor the inclination to gather data and document all this. Nor do I think there would be any value in doing so - I’m sure the developers are aware of and addressing these issues for the upcoming releases.
I just wanted to vent. Thanks for listening.
I agree with the above. GPT4-turbo is wayyyy better for my use case.
4o is a major problem, needs to be removed. It’s still locking everyone into it also, and every new chat is a forced 4o. So when you run out you can’t even switch to 3.5 or 4 because it says “this chat uses tools”, but you can’t even switch so it doesn’t before it runs out so… scam, I call scam here. Yesterday I was using 4o and was able to switch to 3.5 and make a new chat in 3.5, today I can’t. They need to stop messing with things and making people very mad.
certainly wont be relying on OpenAI for anything in the future, they can’t even keep options open for people. ClosedAI is more like it. Wondering if Microsoft pulling their strings is starting to show in their business practices, certainly looks that way.
i’m Still evaluating four point. However, notably, it would seem that I had already gotten. my GPT 4 turbo. to effectively drop in to a retrained niche. So as I start up the new commercial plan, it’s like going back to teaching a child from the beginning again. and the 4O is not necessarily. that much better on some of the tasks than I’m trying to get done. however. It is interesting in the way that they implemented the program. Because basically. Previous chats. get a new opportunity to be upgraded for more extensive capabilities I definitely like that part a whole lot. I did look at your page and your items seem to be quite thorough for a test sequence. but many of the items seem to be sticking to standards that don’t necessarily go in areas that I require. seemingly I’m all by myself working with certain specialized data That effectively requires me to put some rather extensive. prompts data entry to effectively. critique. a train in with the models to actually seem to have a chance to work right. and in many respects I have to significantly upgrade the memory specifications of GPT to be able to handle the data and arguments that I’m working with. I have yet to see many people explain much about how to manage the memory settings. respective to GPT 4 however, does seem that the 4 O GPT will effectively conform to the same memory upgrade standards that I’ve been using I would certainly be interested in anybody else that has any understanding. of various methods to do this in a short manner,
I tested it with college class level documents. Then did extensive tests (questions, choices) on the material. I took all notes, lectures, images from a law class and crammed them in a large pdf. I mainly used GPT-4 architecture vs GPT4o (plus membership). GPT-4 architecture is similar to GPT-4 Turbo. GPT-4o is for high interaction rates that compromise a bit of precision. GPT-4 architecture rarely hallucinates, while GPT4o seems to have more of these moments.
You also have to understand you’re now talking to a different “brain”, different neural networks. It deducts differently. So getting used to communicating to it (tuning) is also part of the challenge. Both engines seem to interpret my “customized” settings differently as did GPT4 Turbo vs CustomGPTs (plus membership) when fed the same instructions. I do a lot of things with GPT4 architecture, like programming, and law documents. At times more than 40hours a week. It’s very efficient, regardless, everything must be double checked.
In this simple test GPT4o fails, vs GPT4 architecture.
Does GPT Turbo and GPT4o use different neural networks?
GPT4o:
Yes, GPT-4 and GPT-4 Turbo use different neural networks. GPT-4 Turbo is designed to be cheaper and faster than GPT-4, although the exact differences in architecture and operation are proprietary and not disclosed by OpenAI. Both models aim to provide high-quality language generation but with different optimization focuses.
Does GPT Turbo and GPT4o use different neural networks?
GPT4 architecture:
“Yes, GPT Turbo and GPT-4o use different neural networks. GPT Turbo is optimized for speed and lower resource usage, making it more suitable for applications requiring fast responses, while maintaining a high level of language understanding and generation capabilities. GPT-4o (often referred to as GPT-4 Optimal) provides more detailed and nuanced responses, suitable for more complex tasks requiring deeper understanding. Each model is tailored for different use cases based on performance and complexity needs.”