I have been experimenting with GPT-4o and have encountered a rather frustrating issue. I use a virtual assistant tool powered by GPT, and the current version is based on gpt4-1106-preview. This assistant, which operates as an agent, relies on various tools to resolve user queries.
However, when evaluating GPT-4o to check response times, which is a notable aspect of this version for the use I want to give it, I confirmed that latency is lower compared to the previous model. Nevertheless, I encountered errors when the assistant tried to answer questions using tools that generate long text sequences.
To function correctly, agents follow very specific instructions. If these are not applied to the letter, failures can occur. With GPT-4o, I observed that it forgets the key instruction to obtain the agent’s response when handling long text outputs. As a result, it generates a response, but by not following the indicated method, it causes an error in the agent, which enters a loop of executing the tool, obtaining text, and failing to respond, until reaching the iteration limit.
I discovered that when the text sequences are shorter or I simplify the instructions, the agent executes the orders correctly. This surprises me, as I have never encountered this type of issue with gpt4-1106-preview, which has worked flawlessly for as long as I have used it, without such errors.
This leads me to assume that it could be an information retention (or memory) problem, which seems to be more deficient in this version (GPT-4o) compared to the previous one (gpt4-1106-preview).
Other possible explanations could be that GPT-4o requires a different set of instructions to operate as an agent, or that the configuration is not correct (I use the default configuration, but with the temperature at 0 to ensure that the instructions are strictly followed).
Finally, from a commercial perspective, if the problem is indeed memory, why was it assigned a context window of 128k? This does not seem to make sense.