GPT-4o vs. gpt-4-turbo-2024-04-09, gpt-4o loses

Hi,

We recently switched to using gpt-4o instead of gpt-4-turbo-2024-04-09, but the prompt working perfectly well with gpt-4-turbo-2024-04-09 doesn’t work with gpt-4o. It simply doesn’t follow the instructions properly.

On the other hand gpt-4-turbo-2024-04-09 is stable, works perfectly well.

Anyone else experiencing similar issues?

M.

2 Likes

gpt-4o will likely be better for most things most people do most of the time, which of course means there will be some things gpt-4-turbo will be better at.

Your application might fall into this category.

But, even if it doesn’t, and gpt-4o is more capable for your use case, you’ll almost certainly need to tweak your instructions for the new model as it is an entirely different model than gpt-4-turbo, with a different architecture and trained on different data.

4 Likes

It’s better at some stuff and not at others. It’s definitely better at coding. But I agree with you, there should be more transparency on exactly what the models are better at. the 4 models are incredibly expensive to anyone running a business that uses alot of tokenage $5 / 1m is waaaay too much. I’d really try to get 3.5 to work as great as possible.

2 Likes

Yes, I have found the same. It seems much more prone to answering what it thinks I am asking than being very careful to follow my precise instructions.

3 Likes

I also had this conclusion by just switching the model. But after I rewrote my prompts. It just performs better and more stable than gpt-4-turbo.

I guess you will have to do the same thing. It might be a little painful but definitely worths it with better TTFT, TPS and half the cost.

3 Likes

I actually rewrote some prompts. Tried in the playground, looked good. Tried in real application, looked good, and then bad, and then good etc. Not stable.

But I’d appreciate any tips regarding how to transform the prompts to work well with gpt-4o.

Yes the new model may be trained with entirely new architecture and different data. The problem is: I’m not. I still have the same “prompt engineering” logic. I think Openai team should guide us regarding how to transform our prompts to work well with the new models (frankly it would be much better to be backward compatible).

1 Like

It’s unlikely even they know as there are essentially an infinite number of possible inputs to the models and depending on how the behavior changes from one version of the model to the next some users may prefer the results they get from the new model with the same prompt while others may prefer the results from the older model.

It simply isn’t feasible to produce a comprehensive model transition guide for everyone’s bespoke prompts.

As far as “backwards compatible” goes… I’m not even entirely sure what that would mean in this context. The new models always produce the same output as the old models?

If the old model gives better results, why not just use the old model? It’s still there. Or use both models depending on which suits your needs in the moment.

1 Like

You may be right. I’m not an AI expert. What I’m saying may not be possible at all.

Regarding backward compatibility; what I meant was that the same prompt doesn’t produce “worse” outputs than before, but again, it may not be possible.

Then you’d need to quantify how often, if ever, the model can produce worse outputs.

Imagine we can strictly quantify a model’s strength.

Then say they have a new model that is ten-times better than the old one at everything other than writing salamander-themed haikus, where it’s only half as good.

Should they not release the new model because it produces worse outputs for that narrow use case?

I would argue they should release such a model.

OpenAI, with very few exceptions, knows everything their GPT models have been and our prompted for. They are building models which they hire are generally better overall, but especially are better for most of the things, most of their users, want to do most of the time.

Unfortunately that sometimes means if you really need it want to do something not very many people are doing, a newer model might not be as strung as an older one for that particular thing you want to do…

Sometimes, the goal might be a model that meets most people’s needs but is much smaller and more efficient, so it can do 90% of what they need for half the cost.

The end-goal is better models for everyone, but there are many competing needs and motivations, so the path there is unlikely to be strictly increasing all the time for everyone.

2 Likes

My observation is that the new model tries to be quick, and compromises quality because of that. Tweaked the prompt, got mixed results.

I can’t quantify my experience properly, all I can do is go back using the old model and draw attention to “my problem” by mentioning it here.

oh boy not me using 3.5 the most stable model and using one shot prompting from a gpt-4o response to get 3.5 to sound like 4o for minus 10x the cost…

Man…

I hope no one steals my idea and also saves themselves loads of money

2 Likes

The one that is 1/4 the price doesn’t tell us what the heck it is writing the haiku about.

###gpt4o

Fiery skin, cool stream,
Silent watcher of the night,
Nature’s gentle flame.

In mossy green glade,
Tiny feet tread ancient paths,
Wisdom in each step.

Emerald eyes gleam,
Mystery in twilight’s hush,
Life’s quiet keeper.

##gpt4

In damp, darkened woods,
Salamander lies unseen,
Life beneath the green.

Fire in its essence,
Salamander, flame’s dancer,
In heat, its presence.

Slimy, sleek, and quick,
Through the moss, a trail it picks,
Salamander’s trick.

now that’s interesting :slight_smile: as an old Chinese proverb says: “a person who gives examples of “using 3.5 the most stable model and using one shot prompting from a gpt-4o response to get 3.5 to sound like 4o for minus 10x the cost” is a person who deserves a golden ticket to heavens”

did a quick research. man if this works, you definitely deserve a golden ticket to the heavens :smiley:

If it works for you, report back to us. And than spread the gospel of 3.5 further.

I will say if you’re looking for like a “Chat buddy” experience - 4o will probably always be better. But for my use case (summarization of data), this seems to have worked.

In my use-case (text-normalization) it may work even better, but let me try and will post the results here. Thanks @tventura94!

1 Like

Definitely do, I’ll be waiting to hear back. Would be interesting if my hunch is correct

can you give an example of how you rewrote a prompt to work better with 4o? Thanks!!

4o is absolute trash for coding. All i do is code with gpt and have been doing it since 3. I told 4o to create me an ahk v2 script to open 4 copies of vlc and place them on each of my monitors full screen. It couldnt even get past that it was v2, it gave me v1 code, i linked it to the documentation told it it was for v2, sent it the error code, it just couldnt do it. Not only that, it keeps trying to assume while giving me the entire solution in a long winded inefficient way over explaining. It also wont listen to my direct commands, i told it to just talk to me and answer my question and it tries to give me the code again.

I just went back to gpt 4 gave it the same prompt, and it gave me the code PERFECTLY.

1 Like