Function calling looping uncontrollably and calling unnecessarily

Trying out a simple prompt (ours is more complex, but this is the bare bones and replicates the bug behavior):

You are a helpful assistant.

Respond with a json object containing the following keys:
- message: the message to the client
- done: a boolean set to true only if the user confirms they have no further questions

I am able to get completions shaped as requested and everything works as expected. If I add a get_weather tool to either model, the tool is ALWAYS called. Literally not even asking about weather, the weather tool is called. If I respond from the tool that weather is unavailable, it will simply call the tool again.

My expectation would be that the model should respond with an answer to the query rather than requesting weather data when there is nothing in the message history requesting weather data.

Am I just doing something wrong here? This is the BAREBONES set up in playground and I am not able to get it to behave as expected.

Thoughts?

1 Like

You must have selected both the function and a response format of type json_object. Change it to text and you shouldnā€™t have that issue.

3 Likes

are you saying itā€™s not possible to have tools in a completion request and expect the output as a json value? It currently works this way in our production codebase, however there are times where thereā€™s an infinite looping of function calling.

update: i just set the response type to text, kept everything else the sameā€¦andā€¦low and behold, it works.

Soā€¦should I not set the response type to json if I want a json response? It still seemed to generate the json as described in the prompt despite text being set.

Do you know how this is affected with structured outputs? Suppose I turn my prompt schema into a json_schema, will the problem with function calling persist?

Sorry if Iā€™m asking a lot of questions, but Iā€™m days into this and covering my bases.

If youā€™re testing structured outputs in the playground then youā€™ll need to provide an actual schema for it by selecting json_schema as the format response, then inputting the schema in the window that pops up.

Introducing Structured Outputs in the API | OpenAI

You can check the API docs for more info on the different response format types.

2 Likes

yep, doing that now, but having an issue with it when it does call the tool, responding to the tool seems to cause an unknown error:

An error occurred. Either the engine you requested does not exist or there was another issue processing your request. If this issue persists please contact us through our help center at https://help.openai.com.

any thoughts on that?

also, iā€™m noticing that despite asking for a JSON response, iā€™m getting text back from the model on almost every response. This was the reason why we set response format to JSON. Why does the combination of ā€œjson_outputā€ with JSON format in the prompt yield endless looping over functions?

Have the exact same issue. Lot of my apps are designed this way and itā€™s suddenly stopped working. Switching to structured outputs will taken time as Iā€™ve specified most of these json schemas as plain text structures. It used to work perfectly till like 2 weeks ago.

1 Like

deactivate memory and change browser or empty cache

No, this is a real problem that arose with the introduction of structured output and the ā€œjson_schemaā€ response format.
Using ā€œjson_objectā€ wasnā€™t a problem until recently, even on GPT-4o, but now using this type of JSON output with function calls is totally incompatible.
Weā€™ve had overconsumption as a result of parallel loop calls (thousands of dollars), and even by circumventing this, GPT-4o models do anything and can successively call the same function several times with the same parameters, knowing that the model is getting the result right.
The modelā€™s response if you go further and question it by stopping its ability to call the function: It needs to check and check again whether the function gives a consistent result.
I reported this bug to OpenAI 2 days ago.

Have you found a solution to this? I have noticed from a previous posterā€™s suggestion that changing back to text for response type has unblocked us and I havenā€™t seen the issues with tool calls looping. The model does sometimes return simple text when I am asking for JSON, so that is less reliable. It might be tunable with prompt engineeringā€¦

We are still experimenting for now as our engineering cost to move to structured outputs is similarly high. Our occurrence rate for these infinite loops is fairly low and has mostly occurred in local testing, We hope to get ahead of this and find a solution before it becomes more widespread.

Yes, in fact, switching back to text is the only valid solution.
The problem can be reproduced in a minute from the OpenAI Playground.
All you have to do is select the ā€˜json_objectā€™ format and add a function (the weather forecast, for example, is already ready).
In this way, the GPT4o family simply calls up the function.
Whatā€™s most disturbing is when you switch back to text format during the conversation. The model can reply again and doesnā€™t understand why it felt compelled to do this, while explaining that it must be a bug.
So I switched back to text on my production platformā€¦

1 Like

You would never want to use tool calling and JSON mode together in the same inference, and the fact that playground allows this is a bug in playground. Here is when to use each:

Json mode: when you donā€™t know (or care about) the structure of your data, but you do know you need it back as a json object. Example: you pass in a blob of text headers and youā€™re using the LLM to give you back a headers dict that you could use for http requests.

Tool calling: you do know the structure AND youā€™re legitimately using the LLM to call tools OR you need more flexible extraction than structured outputs can provide (datetime, defaults, etc).

Structured outputs: You do know the structure and you have no need for further LLM generation from the direct response to it (tool calling).

Now, there are some creative combinations that can occur such as: LLM tool call ā†’ tool response ā†’ structured output response. But you would still never supply the model with a JSON schema and then force it into json mode.

TLDR: Json mode when structure is undefined. Tool calling or structured outputs when it is. Never used together.

Preface: Use gpt-4o-2024-08-06 specifically for response mode with json_schema (structured outputs)

Functions + json_schema is giving errors, against expectation, with chat completions playground code.

Documentation: Structured Outputs is available in two forms in the OpenAI API:

  1. When using function calling
  2. When using a json_schema response format

Documentation doesnā€™t spell out these are exclusive separate uses.

I made a function requiring two arrays sent about the same entity extraction items, where the AI must call a function to find out the class of item instead of making its own determination (is ā€œtomatoā€ a vegetable or fruit according to us).

The AI somewhat unexpectedly employs parallel tool calls to split the entities it asks about. I type up a response for each function call because I donā€™t actually have answering code.

(playground preset at this point)

  • text: responds
  • json_object: responds (after placing a ā€œjsonā€ in prompt)
  • json_schema: gives a UI error ā€œNetworkError when attempting to fetch resource.ā€

Actual network error 500:
{
ā€œerrorā€: {
ā€œmessageā€: ā€œThe server had an error processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if you keep seeing this error. (Please include the request ID req_49e9d8106e89eb6bbebbe24c9aaf215d in your email.)ā€,
ā€œtypeā€: ā€œserver_errorā€,
ā€œparamā€: null,
ā€œcodeā€: null
}
}

Since sending to the API is not blocked, and a reason not returned, I assume thereā€™s a structure validation conflict - like the response mode turning on functions strictness, and the json output not being parsable by the tool recipient determiner (or the opposite).


Also annoying in playground: thereā€™s a switch to go between json and plain text, but it seems to add and remove whitespace that I prohibit, making what was actually received unclear.

1 Like

Iā€™m not defending the state of the playground, but you are demonstrating an issue with the playground itself and not the API. Your scenario works in a notebook with the API. I would suggest prototyping stuff like this in your own notebook instead of playground.

You seem extremely knowledgeable on when and how to use the various config options in the completions api and have given advice that I donā€™t think Iā€™ve seen anywhere in the API docs. Do you have a resource youā€™d recommend?

Our basic use case is completions endpoint as a chatbot wrapped by a thin API layer to achieve some data fetching/context augmentation in the course of a visitor conversation via tool calling. Our current structure is we pass along the available tools to the completions endpoint with every request as well as the transcript history, requesting a completion, responding to any intermediate tool calls where applicable, and keeping ā€œrunning contextā€ like visitor info etc in the completions response (hence json output).

Not asking you to solve our use case, but if you have any thoughts or a good resource to understand how best to architect this that you could share, that would be swell.

1 Like

Can you help me understand the ā€œrunning contextā€? Are you trying to store a conversation rollup in a JSON object in the context window? Are you using the LLM to do this? Are you expecting the LLM to call a tool AND provide this running context in the same API call? Are you able to give me a better idea on the sequence and workflow?

So essentially we might keep in the json response a running summary of key data from the transcript, such as a visitorā€™s contact info, an order number, whether the conversation can/should be marked as completed, something like that. The expectation is more or less realtime (or fairly quick roundtrips) so wanting to limit loops to the LLM.

A conversation might look like this:

User: hey, can you tell me where my package is?

Assistant (JSON): { message: 'Sure, what is your tracking number and name?', name: '', tracking: '', done: false }

User: yeah, itā€™s Mike, and tracking is 12345

Assistant (TOOL CALL): { tool_call: get_order_by_tracking, tracking: 12345 }

Tool (TOOL RESPONSE): shipped out on 3/3, arrival expected by 3/12

Assistant (JSON): { message: 'Your package was mailed on 3/3 and should arrive by 3/12. Feel free to reach out again if you haven't received your package by then. Can I help you with something else?', name: 'Mike', tracking: '12345', done: false }

User: no, thatā€™s it, thanks

Assistant: { message: 'It was a pleasure helping you today!', name: 'Mike', tracking: '12345', done: true }

The done basically tells us we can close the thread out and run analytics against it, the other stuff is used by the intermediate message processing functions.

This is a contrived example, but a glimpse into our usage.

This is probably best answered with a code snippet. Please let me know if you have any questions.

import pydantic


class PrivateState(pydantic.BaseModel):
    """This keeps track of the current (private) state of the conversation for the backend."""

    is_customer_query_resolved: bool = pydantic.Field(
        description="Only mark this field as True if the customer has explicitly indicated that "
        "their query has been resolved and there is nothing else that you can assist them with."
    )
    tracking_numbers: list[str] | None = pydantic.Field(
        description="The tracking numbers IF any are indicated in the chat."
    )


class ResponseFramework(pydantic.BaseModel):
    """Use this to reply to customers and manage the conversation"""

    response_to_customer: str = pydantic.Field(
        description="When answering queries, always ask if there are other "
        "ways to help within the context of the conversation."
    )
    private_state: PrivateState


messages = [
    {
        "role":"system",
        "content": "You are a package tracking assistant. Only consider the user's query resolved after "
        "asking them if there's anything else you can assist them with and they explicitly reply "
        "indicating that there is nothing else you can help them with."
        },
    {
        "role": "user",
        "content": "hey, can you tell me where my package is? My tracking number is H1234",
    },
    {
        "role": "assistant",
        "tool_calls": [
            {
                "id": "1",
                "type": "function",
                "function": {
                    "name": "get_order_by_tracking",
                    "arguments": json.dumps({"tracking": "H1234"}),
                },
            }
        ],
    },
    {
        "role": "tool",
        "tool_call_id": "1",
        "content": json.dumps(
            {
                "shipped_on": "2024-01-01",
                "expected_delivery": "2024-01-03",
            }
        ),
    },
]

r = client.beta.chat.completions.parse(
    model="gpt-4o-mini", messages=messages, response_format=ResponseFramework
)
p = r.choices[0].message.parsed
print(json.dumps(p.model_dump(), indent=2))


# {
#   "response_to_customer": "Your package with tracking number H1234 was shipped on January 1, 2024, and is expected to be delivered by January 3, 2024. Is there anything else I can assist you with?",
#   "private_state": {
#     "is_customer_query_resolved": false,
#     "tracking_numbers": [
#       "H1234"
#     ]
#   }
# }

Yes, that why I frame the problem in my first sentence as within playground. Whack ā€œget codeā€.

But using the share, and switching up the output, you can see that the disclaimed methods of ā€œsolutionā€ actually work fine, and it is the AI model quality and input context that is creating more responses that start with a function call token rather than deciding to write to the user.