Strange Agent Behaviour With Tool Calling September 4 2024

Hello,

I am performing some experiments for a research paper and I observed that the tool calling is no longer reliable in my gpt-4o (and mini) calls since this morning. I have users that participate in my experiment and they told me that the app is no longer functioning. I checked and it seems that since today, the LLM agent doesn’t understand anymore the stopping criteria of my agent that uses retrieval tools. This happens for both gpt 4o and gpt 4o mini. Important to mention is the fact that the code was not changed, nor the retrieval tools, so I might suspect it is something from a model change.

I checked the update pages of OpenAI but it doesn’t seem that there was a version update for the models today.

Does anyone encountered similar problems today?

1 Like

Yes, if you search my recent posts I’ve experienced the same over the past week or so with tool calling no longer being stable or reliable. This is with Assistants/Streaming, and seen more often with gpt-4o-mini than gpt-4o.

1 Like

Facing the same issue over our APIs… :cry:

1 Like

JSON mode (response_format: {type: "json_object"})
with Javascript openai SDK 4.57.1 using openai.beta.chat.completions.runTools

In my case, model “gpt-4o” (gpt-4o-2024-05-13) is now calling all my tools on first request.

I switched to gpt-4o-2024-08-06,

  • A first user request with the need of a function is working
  • A second user request with the need of function now failed with 400 Invalid 'messages[2].tool_calls': empty array. Expected an array with minimum length 1, but got an empty array instead.

What I had to change is to exclude from my history messages the empty tool_calls array received on ChatCompletion where finish_reason !== "tool_calls"

const runner =  openai.beta.chat.completions.runTools({
    stream: false,
    messages: chatSession.history.toObject(),
    model: "gpt-4o-2024-08-06",
    temperature: 0.2,
    max_tokens: 1500,
    n: 1,
    tools: [getUserLocationTool, getWeatherTool],
    response_format: {type: "json_object"},
}).on("message", (message) => {
    if (message.role === "tool") {
        chatSession.history.push(message)
    }
}).on("chatCompletion", (chatCompletion) => {
    const message = chatCompletion.choices[0].message
    
    delete message.refusal
    delete message.parsed

    if (chatCompletion.choices[0].finish_reason !== "tool_calls" && message.tool_calls && !message.tool_calls.length) {
        delete message.tool_calls
    }

    chatSession.history.push(message)

})

const finalContent = await runner.finalContent()

await chatSession.save()

try {
    return JSON.parse(finalContent)
} catch(e) {
    console.log('Unable to parse finalContent')
}

I tried to use the model gpt-4o-2024-05-13, and it seemed to work as before, as you did. Once in a while I get “output: multi_tool_use.parallel is not a valid tool, try one of [MY_TOOLS].”. It seems that the multi_tool_use.parallel is something internal the model uses, so I guess I just have to wait for them to deploy a fix. If I use gpt-4 or gpt-4-turbo, everything works as it worked before, but the omni version just crashes since yesterday.