GPT-4o exhibits inferior memory compared to gpt4-1106-preview

I have been experimenting with GPT-4o and have encountered a rather frustrating issue. I use a virtual assistant tool powered by GPT, and the current version is based on gpt4-1106-preview. This assistant, which operates as an agent, relies on various tools to resolve user queries.

However, when evaluating GPT-4o to check response times, which is a notable aspect of this version for the use I want to give it, I confirmed that latency is lower compared to the previous model. Nevertheless, I encountered errors when the assistant tried to answer questions using tools that generate long text sequences.

To function correctly, agents follow very specific instructions. If these are not applied to the letter, failures can occur. With GPT-4o, I observed that it forgets the key instruction to obtain the agent’s response when handling long text outputs. As a result, it generates a response, but by not following the indicated method, it causes an error in the agent, which enters a loop of executing the tool, obtaining text, and failing to respond, until reaching the iteration limit.

I discovered that when the text sequences are shorter or I simplify the instructions, the agent executes the orders correctly. This surprises me, as I have never encountered this type of issue with gpt4-1106-preview, which has worked flawlessly for as long as I have used it, without such errors.

This leads me to assume that it could be an information retention (or memory) problem, which seems to be more deficient in this version (GPT-4o) compared to the previous one (gpt4-1106-preview).

Other possible explanations could be that GPT-4o requires a different set of instructions to operate as an agent, or that the configuration is not correct (I use the default configuration, but with the temperature at 0 to ensure that the instructions are strictly followed).

Finally, from a commercial perspective, if the problem is indeed memory, why was it assigned a context window of 128k? This does not seem to make sense.

1 Like

Given that GPT-4o has a different underlying architecture than the other GPT-4 models, it is fair to assume that amendments to instructions are required to achieve similar outputs than before.

I do agree however with your observation - which is consistent with what others have reported and also reflects my own experience so far - that instruction following occasionally is a challenge under this model. See also this thread: GPT-4o vs. gpt-4-turbo-2024-04-09, gpt-4o loses - #10 by elmstedt

OpenAI itself has also made the point that the model may underperform GPT-4-turbo in some cases and that they are still looking to gather feedback as to when that specifically is the case:

As measured on traditional benchmarks, GPT-4o achieves GPT-4 Turbo-level performance on text, reasoning, and coding intelligence, while setting new high watermarks on multilingual, audio, and vision capabilities.

Through our testing and iteration with the model, we have observed several limitations that exist across all of the model’s modalities. We would love feedback to help identify tasks where GPT-4 Turbo still outperforms GPT-4o, so we can continue to improve the model.


Thanks for your answer. I guess you are still experimenting with this new version of gpt.

1 Like