Hi,
I have question about tool calling and consistency when using different models. I noticed some models like GPT-4o almost always call the provided (single) tool multiple times depending on the user prompt and other models like GPT-4.1 or GPT-5 Mini often times calls it once, but depending on the user/system prompt, it can call the same tool twice in the same response.. Also tried parallel_tool_calls but this flag then unfairly prevents calls for other legitimate functions..
Some examples using GPT-4o
Notice even though the tool properties are informed as an array GPT-4o decides to make two separate function call to the same function. My expectation was to receive a single function tool call with all items in the arguments.
Request:
{
"model": "gpt-4o",
"temperature": 0,
"store": false,
"messages": [
{
"role": "system",
"content": "You are a helpful general purpose assistant. Use the jiraSearchTool to provide comprehensive answers."
},
{
"role": "user",
"content": "I want to learn about TEST-123 and I have some questions about TEST-456"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "jiraSearchTool",
"description": "Performs search through Atlassian Jira API to retrieve information for the provided Jira issue IDs.",
"parameters": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"jiraIssueIds": {
"description": "List of Jira issue IDs provided in the user prompt",
"type": "array",
"items": {
"type": "string"
}
}
},
"required": [
"jiraIssueIds"
],
"additionalProperties": false
},
"strict": true
}
}
],
"tool_choice": "auto"
}
Response:
{
"id": "chatcmpl-Cbnj4tXGBL99YQN0y9ABgPxdFRKEl",
"object": "chat.completion",
"created": 1763125318,
"model": "gpt-4o-2024-08-06",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_qNNJA3iqSIO89p7XADV9c90o",
"type": "function",
"function": {
"name": "jiraSearchTool",
"arguments": "{\"jiraIssueIds\": [\"TEST-123\"]}"
}
},
{
"id": "call_q1cjrAfIMCqtEup5r0U80ffi",
"type": "function",
"function": {
"name": "jiraSearchTool",
"arguments": "{\"jiraIssueIds\": [\"TEST-456\"]}"
}
}
],
"refusal": null,
"annotations": []
},
"logprobs": null,
"finish_reason": "tool_calls"
}
]
}
This might be expected in some cases, but it can quickly get tricky on subsequent requests the the order of messages from assistant > tool > assistant > tool > needs to be preserved by putting messages with correct tool id one after another..
Here is another response, sometimes model decides to make a single call. This is what I expect, but GPT-4o is quite stubborn and does not consistently return a single tool call, even when add new instructions to make only one tool call for any given function and consolidate arguments in that..
{
"id": "chatcmpl-CbnqhxrMJF4IyMQIOOxCrFe2Qe1O8",
"object": "chat.completion",
"created": 1763125791,
"model": "gpt-4o-2024-08-06",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_WH0SX8p8Ta4nAIRZttzb64Hs",
"type": "function",
"function": {
"name": "jiraSearchTool",
"arguments": "{\"jiraIssueIds\": [\"TEST-123\", \"TEST-456\"]}"
}
}
],
"refusal": null,
"annotations": []
},
"logprobs": null,
"finish_reason": "tool_calls"
}
],
"usage": {
"prompt_tokens": 122,
"completion_tokens": 41,
"total_tokens": 163,
"prompt_tokens_details": {
"cached_tokens": 0,
"audio_tokens": 0
},
"completion_tokens_details": {
"reasoning_tokens": 0,
"audio_tokens": 0,
"accepted_prediction_tokens": 0,
"rejected_prediction_tokens": 0
}
},
"service_tier": "default",
"system_fingerprint": "fp_b1442291a8"
}
Now trying GPT-4.1
Request: I had to still instruct to make single tool call for any given function.
{
"model": "gpt-4.1",
"temperature": 0,
"store": false,
"messages": [
{
"role": "system",
"content": "You are a helpful general purpose assistant. Use the jiraSearchTool to provide comprehensive answers. You have to make only one tool call for any given function and consolidate arguments in that call."
},
{
"role": "user",
"content": "I want to learn about the Jira issue TEST-123 and I have some questions about TEST-456"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "jiraSearchTool",
"description": "Performs search through Atlassian Jira API to retrieve information for the provided Jira issue IDs.",
"parameters": {
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {
"jiraIssueIds": {
"description": "List of Jira issue IDs provided in the user prompt",
"type": "array",
"items": {
"type": "string"
}
}
},
"required": [
"jiraIssueIds"
],
"additionalProperties": false
},
"strict": true
}
}
],
"tool_choice": "auto"
}
Response:
{
"id": "chatcmpl-Cbo0K2F8ZsmpW9i9JuFFM899wUU9N",
"object": "chat.completion",
"created": 1763126388,
"model": "gpt-4.1-2025-04-14",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_CIemwNhEPDlXMb6NiafCFuIt",
"type": "function",
"function": {
"name": "jiraSearchTool",
"arguments": "{\"jiraIssueIds\":[\"TEST-123\",\"TEST-456\"]}"
}
}
],
"refusal": null,
"annotations": []
},
"logprobs": null,
"finish_reason": "tool_calls"
}
]
}
My questions are:
_ What do you think of this situation?
_ What is your go-to model when making tool calls?
_ Is this a prompting issue and should I enforce/instruct more in the system prompt?
_ Does it make sense to use another, more consistent model to make the tool decisions and then use any other model to produce the final answer?