New models are incapable of proper function calling

The models gpt-4-1106-preview and gpt-3.5-turbo-1106 simply cannot comprehend how function/tool calling works. They don’t understand any system or user prompt on when to use the functions. They don’t understand what you write in each function’s description. They always call the function no matter if it is needed or not, and don’t even care what the function is. They call one or more functions in auto mode as long as there is a function to call :D. They even call functions named “never_call_this_function” with made-up parameters. gpt-3.5-turbo-16k was a genius compared to these new models. It correctly identifies what the functions do and intelligently decides when to use which, correctly. Why OpenAI ignores this??

I noticed this too. I tried to create a text adventure with function calling and it would create a new player after every message and rolled a lot of dice for no reason. My web browsing assistant works nice though since calling the web search function all the time is intended behavior lol.

Hey jas313. I’ve spent a huge amount of time working with gpt3 and gpt4 to get tabletop RPG’s to work. I’m the author of DungeonGod-AGI (github).

An approach I’ve taken is to carefully and very simply explained the sequence of the game and the high level rules, then given a SINGLE function to the AI to call, called do_action(action, …) (I originally called it do_turn()) and given the AI a table of actions and arguments.

My table was formatted:
Explore Actions:
“look”, - Looks at the target which should be the proper name of a charcter, monster, or item.
“pickup”, , <optional_qty> - Picks up an item. An optional quantity can be provided.

For some reason having single function to call to take a turn seems to keep the AI from becoming confused by a large set of functions. This seems to work very well.

Also you’re welcome to use my code (MIT) or contribute if you’d like. DungeonGOD is a MAME like project to implement the core of any set of tabletop RPG’s so they run in any context of AI applications. I’ll be doing a major update to the git repo soon with GPTS and Actions support.

1 Like

That sounds interesting, I will go have a look. It would be really cool to have an AI run a game as the world could be different in each adventure.

Thank you for sharing your approach to the problem. We also though about a similar structure, but it is really not a good workaround. First because it simply doubles your token consumption for a problem that didn’t exists with previous models, because we would make a request to determine which function should be called, then make the same request by adding the details of that function and force calling it. And second, because it simply didn’t work either.

The models work almost like they are manually programmed to do: if function exists → then call it no matter what. It doesn’t matter what you write in the function description, it doesn’t matter what you write in the system prompt. It is simply broken. As a side note, we don’t even have a large set of functions, this behavior exists with 2 available functions.

So, in the example of DungeonGod, imagine you shared the function you describe, which has look and pick up options, and tell the model “Hi, I want to play a game” and the model executes:

{ “look” : “Look around to find a way to play a new game.”}

This is the kind of problem we are having.

I can speak from long frustrating experience that little things in your system prompt matter. Here are some suggestions based on my own experience:

  1. Keeping the system prompt clear and concise is key. The AI can become confused by too much information.
  2. Focusing on a concise list of functions with very clear distinctions between each help the model know why to call one vs. another.
  3. Using examples helps in many circumstances.
  4. Using evocative terms to connect the meaning of functions to the model’s world knowledge helps. Sometimes adding a single term can fix a perplexity issue.

However, none of these alone as strategies are perfect. The best way to ensure you get the behavior you want is to just iterate, and building infrastructure around shortening that iteration time to allow you to exhaustively test changes in prompting is the most fruitful strategy. I’ve found that ultimately you do get to a point where it suddenly just “works” and the AI seems to understand.

DungeonGod (the version in the repo) works on GPT-3-turbo. It originally didn’t come close to working on GPT-3 and it was only through an intense amount of effort around fixing the prompting that it now does. So it “can” be done.


That sounds interesting, I will go have a look. It would be really cool to have an AI run a game as the world could be different in each adventure.

DungeonGod-AGI is very much not that. It presents traditional D&D modules where the AI takes on the role of the Dungeon Master. The game is still very much structured around the content in the modules, but the AI does do a lot of improvisation just like a human Dungeon Master would.

Would be fun to have AI generated worlds, but so far my experimentation with that has not resulted in great results. The planning and architecting for really cohesive and meaningful content is not quite within the capabilities of this round of AI. Perhaps with some additional infrastructure, but that’s not a goal of my project right now.

ben thank you for your insights. But our points is, we have been through everything you mentioned, and fine tuned our system prompts, function names, descriptions, variable names and most suitable types etc. And the system is working perfectly with gpt-3.5-turbo-16k.

So, it is not about us not knowing how to approach the problem. It is, in fact, quite the opposite. We have mastered how things are done with GPT technology, and the latest models are simply useless for function calling, and we are wondering when OpenAI will address this issue.