Issue with Assistant API (GPT-4.1) not consistently calling functions

Hi everyone,
I’m building a chat-based app with a virtual car mechanic that helps users schedule appointments and estimate repair costs.

I’m using OpenAI’s Assistant API with model GPT-4.1, and I’ve defined two functions:

  • estimate_repair – called when the user asks about the cost of a specific repair,
  • get_car_details – called when the user provides the make, model, and year of the car, to retrieve more detailed vehicle data for better cost estimation.

The problem:
When a user sends a message like:
“How much will it cost to replace brake pads on a 2015 Ford Fiesta?”,
the assistant should:

  1. recognize that there’s enough info to call get_car_details,
  2. call that function first,
  3. then proceed to estimate_repair once the details are retrieved.

However, in about 60% of the cases, the assistant just replies with something like “I’ll now retrieve your vehicle data” but doesn’t actually call any function. The remaining 40% of the time it works correctly.

I’m wondering:

  • Is this a known issue with GPT-4.1 and function calling?
  • Could it be caused by system prompt configuration or how the functions are defined?
  • Any ideas on improving consistency or workarounds?

Thanks in advance for any insights or suggestions!

5 Likes

It can be tricky to get these things working really consistently.

The full 4.1 model is quite good with function calling usually, however.

Out of interest, why do you have two separate functions?

why not have estimate_repair with params:

  • car_model
  • maintenance_activity (or some such)

because to me that example prompt you provide really should be targeting one function.

If you need a better interpretation of “car model” you could perhaps do that internally within “estimate_repair” and you could do some state management within the function to record the car model for the session.

If you create too many functions that have inter-dependencies you are likely to make the system more fragile.

If you are lucky, sometimes the model dispatches multiple functions “intelligently”, but as you’ve discovered, this doesn’t always work.

Another approach would be to proactively seek the information about car model from the get-go using the system prompt so the conversation starts out like:

“Hello, how can I help you? What car does your query relate to?”

Then you can cover off that element separately.

Then when they ask “How much will it cost to replace brake pads” the LLM has a simpler problem to solve and so will be more reliable.

2 Likes

To your question: I’ve split the logic into two functions (get_car_details and estimate_repair) intentionally, because I want to support dynamic refinement of vehicle details during the conversation.

Here’s the rationale:
Users often provide partial vehicle data (e.g., just “Ford Fiesta”).
In that case, get_car_details fetches basic vehicle info (e.g., base trims, popular configurations).
Later in the chat, the user might clarify that they have a 1.6L engine instead of 2.0L, or mention aftermarket modifications, or that their model has a sports package — all of which could significantly affect the repair estimate.
By decoupling the retrieval of car details from the repair estimation logic, I can:
return estimated results with varying levels of confidence based on the completeness of data, prompt the user to refine the data as needed before the repair cost is generated,
and allow for more natural back-and-forth, especially when users update or correct details after the assistant has already responded.

Appreciate your insights — this kind of feedback is very helpful!

2 Likes

Yeah that’s fair.

So I’d try to pursue refinement of the model proactively by modifying the system prompt to urge the LLM to encourage that.

Then you increase the chances the user will separate the concerns and the function choice will become a simpler prospect.

1 Like

I’ve been through the same problem. But from wt i see, ur assistant doesn’t have tasks complex enough to run into such issues, so I think you should focus on improving your prompt.

Try to separate the purpose of each function in your prompt by describing their use cases individually. one major thing your prompt should include is examples one shot or even few shot learning is a powerful approach that you should definitely leverage in your prompt.

ps. u can dm me your prompt if you’re still having issues after trying with a better one.

1 Like

I’ve encountered the same problem. When I switch to the GPT-4o model, the tools work normally, but when I use GPT-4.1, it doesn’t call tools no matter what. This is so strange.

2 Likes

I find the smaller 4.1 models very unreliable with function calling, but no function calling sounds weird. Try to make your prompts (including function descriptions) very obvious. 4o is probably more forgiving.

I have a chatbot using functions with 4.1 (full cream version) and it works pretty reliably. If I change it to 4.1 mini or nano, it starts to mess up though.