New models are incapable of proper function calling

The models gpt-4-1106-preview and gpt-3.5-turbo-1106 simply cannot comprehend how function/tool calling works. They don’t understand any system or user prompt on when to use the functions. They don’t understand what you write in each function’s description. They always call the function no matter if it is needed or not, and don’t even care what the function is. They call one or more functions in auto mode as long as there is a function to call :D. They even call functions named “never_call_this_function” with made-up parameters. gpt-3.5-turbo-16k was a genius compared to these new models. It correctly identifies what the functions do and intelligently decides when to use which, correctly. Why OpenAI ignores this??

I noticed this too. I tried to create a text adventure with function calling and it would create a new player after every message and rolled a lot of dice for no reason. My web browsing assistant works nice though since calling the web search function all the time is intended behavior lol.

Hey jas313. I’ve spent a huge amount of time working with gpt3 and gpt4 to get tabletop RPG’s to work. I’m the author of DungeonGod-AGI (github).

An approach I’ve taken is to carefully and very simply explained the sequence of the game and the high level rules, then given a SINGLE function to the AI to call, called do_action(action, …) (I originally called it do_turn()) and given the AI a table of actions and arguments.

My table was formatted:
Explore Actions:
“look”, - Looks at the target which should be the proper name of a charcter, monster, or item.
“pickup”, , <optional_qty> - Picks up an item. An optional quantity can be provided.

For some reason having single function to call to take a turn seems to keep the AI from becoming confused by a large set of functions. This seems to work very well.

Also you’re welcome to use my code (MIT) or contribute if you’d like. DungeonGOD is a MAME like project to implement the core of any set of tabletop RPG’s so they run in any context of AI applications. I’ll be doing a major update to the git repo soon with GPTS and Actions support.

1 Like

That sounds interesting, I will go have a look. It would be really cool to have an AI run a game as the world could be different in each adventure.

Thank you for sharing your approach to the problem. We also though about a similar structure, but it is really not a good workaround. First because it simply doubles your token consumption for a problem that didn’t exists with previous models, because we would make a request to determine which function should be called, then make the same request by adding the details of that function and force calling it. And second, because it simply didn’t work either.

The models work almost like they are manually programmed to do: if function exists → then call it no matter what. It doesn’t matter what you write in the function description, it doesn’t matter what you write in the system prompt. It is simply broken. As a side note, we don’t even have a large set of functions, this behavior exists with 2 available functions.

So, in the example of DungeonGod, imagine you shared the function you describe, which has look and pick up options, and tell the model “Hi, I want to play a game” and the model executes:

{ “look” : “Look around to find a way to play a new game.”}

This is the kind of problem we are having.

I can speak from long frustrating experience that little things in your system prompt matter. Here are some suggestions based on my own experience:

  1. Keeping the system prompt clear and concise is key. The AI can become confused by too much information.
  2. Focusing on a concise list of functions with very clear distinctions between each help the model know why to call one vs. another.
  3. Using examples helps in many circumstances.
  4. Using evocative terms to connect the meaning of functions to the model’s world knowledge helps. Sometimes adding a single term can fix a perplexity issue.

However, none of these alone as strategies are perfect. The best way to ensure you get the behavior you want is to just iterate, and building infrastructure around shortening that iteration time to allow you to exhaustively test changes in prompting is the most fruitful strategy. I’ve found that ultimately you do get to a point where it suddenly just “works” and the AI seems to understand.

DungeonGod (the version in the repo) works on GPT-3-turbo. It originally didn’t come close to working on GPT-3 and it was only through an intense amount of effort around fixing the prompting that it now does. So it “can” be done.


That sounds interesting, I will go have a look. It would be really cool to have an AI run a game as the world could be different in each adventure.

DungeonGod-AGI is very much not that. It presents traditional D&D modules where the AI takes on the role of the Dungeon Master. The game is still very much structured around the content in the modules, but the AI does do a lot of improvisation just like a human Dungeon Master would.

Would be fun to have AI generated worlds, but so far my experimentation with that has not resulted in great results. The planning and architecting for really cohesive and meaningful content is not quite within the capabilities of this round of AI. Perhaps with some additional infrastructure, but that’s not a goal of my project right now.

ben thank you for your insights. But our points is, we have been through everything you mentioned, and fine tuned our system prompts, function names, descriptions, variable names and most suitable types etc. And the system is working perfectly with gpt-3.5-turbo-16k.

So, it is not about us not knowing how to approach the problem. It is, in fact, quite the opposite. We have mastered how things are done with GPT technology, and the latest models are simply useless for function calling, and we are wondering when OpenAI will address this issue.

I noticed that when starting out with the 3.5t 1106 model (the one capable of retrieval), with a clean pasted prompt, everything worked just fine regarding function calling. BUT whenever you change the model (even if you changed it only once and saved it) it messes the function calling and retrieval capability. It seems like if I change the model once, the retrieval capabilities turn off, therefore it can’t get important info and starts making up random functions to make up for it.

You should use tool call not function call with GPT 1106. They are different

They are the same, providing identical specification language to the AI, unless you specify or invoke parallel tools.

Happy this showed up, as no matter what I do in terms of system prompt or function definitions, every single function is called (sometimes twice!) and usually don’t even follow the scenes defined at all.

Quite frustrating as I could have sworn this worked better over the summer.

UPDATE: function-calling with GPT-3.5 Turbo just doesn’t work at all for me anymore. Wrong functions called, when they are, incomplete parameters. The same exact requests with GPT-4 Turbo (1106-preview) deliver much better results - just slightly slower. (Hope this helps!)

I got it to not do multiple calls by reducing max_tokens.
I agree that GPT-4 Turbo seems to be working better, especially with the negative/inhibitory hints and instructions.

I’m having issues with Assistant Function Calling (Tools). With large data sets, it calls the correct functions at the right times, but with the wrong arguments. It makes up data, when I explicitly ask it to only pass data from my data set (i.e. customer numbers to call an external API). I have tried with GPT-4 and preview. Are you all saying that gpt-3.5-turbo-16k works better with function calling?

The place to avoid AI fabrications is in the function definition.

Provide a description for each parameter. A suitable description:

customer_number: number
This customer ID number must come directly from the customer_lookup function. Without the AI seeing the ID number returned by customer_lookup, this entire function cannot be called.

I’ve provided names, types, and descriptions for each of my parameters, as well as a function description, and all the above in the instructions for the assistant. The assistant continues to call the functions with incorrect arguments. It will supply empty arrays when I’ve set them as required. It will supply boolean arguments for parameters I’ve defined as strings. Most frustrating, it supplies the fabricated ids, as I mentioned previously.

Also, the assistant tends to “batch” larger numbers of ids into smaller batches, and call them instead of all at once, which is preferred.

Is there anything else I can do to mitigate?

The generation of AI language by random sampling of probabilities continues into the text generation of functions.

You can try API parameters of top_p and temperature = 0.01 (edit: on real chat completions) and see if this immediately curtails the symptom, of inappropriate data types especially.

1106 should be avoided; you can choose from the quality of -0613 or the affordability of -0125 models, specified by name.

Do you mean gpt-4-0613? I haven’t found much difference in the different models. Also, you mentioned temperature and top_p, but per the documentation these are not yet available via the assistants api. Am I wrong?

Yes, or gpt-3.5-turbo-0613 over its similar newer tool-call cheap input cousins.

-0613 models can’t be used if you have assistants’ retrieval enabled, because of the need for parallel tool call only trained in the post-November “gpt-4-turbo” models at 1/3 the price. And it seems you desire receiving parallel.

And you are correct, in that Assistants lacks several of the parameter controls - or expense controls – that you’d want.

We don’t have retrieval enabled. What do you mean “Receiving parallel”? Do you mean “Retrieving parallel”?

I don’t need tool calling in parallel, and I don’t need retrieval. Thus, I would want the gpt-4-0613 model?