How can I implement CoT reasoning before tool calling using the Chat Completions API?

I came across an example in the “GPT-4.1 Prompting Guide”, which recommends using “Prompting-Induced Planning & Chain-of-Thought” techniques. The provided SWE-bench Verified example uses the responses API, and demonstrates how the model can reason step-by-step before invoking a tool.

Here’s a snippet from that example:

response = client.responses.create(
    instructions=SYS_PROMPT_SWEBENCH,
    model="gpt-4.1-2025-04-14",
    tools=[python_bash_patch_tool],
    input=f"Please answer the following question:\nBug: Typerror..."
)

response.to_dict()["output"]

And here’s the output:

[{'id': 'msg_67fe92df26ac819182ffafce9ff4e4fc07c7e06242e51f8b',
  'content': [{'annotations': [],
    'text': "Thank you for the report, but “Typerror” is too vague for me to start debugging right away.\n\n**To make progress, I need to:**\n1. Find the exact error message text (e.g. `'TypeError: ...'`).\n2. Find which file and which line/function/class the error occurred in.\n3. Figure out what triggered the error (test file, usage, reproduction steps).\n4. Find the root cause and details.\n\n**Next steps:**\n- Investigate error/log/test output files for a Python `TypeError` message.\n- Examine the relevant code sections for problematic type usage.\n- If possible, reproduce the bug locally.\n\n**Plan:**\n- First, I will search for test files and log output in the `/testbed` directory that may contain the full error message and stack trace.\n\nLet’s start by listing the contents of the `/testbed` directory to look for clues.",
    'type': 'output_text'}],
  'role': 'assistant',
  'status': 'completed',
  'type': 'message'},
 {'arguments': '{"input":"!ls -l /testbed"}',
  'call_id': 'call_frnxyJgKi5TsBem0nR9Zuzdw',
  'name': 'python',
  'type': 'function_call',
  'id': 'fc_67fe92e3da7081918fc18d5c96dddc1c07c7e06242e51f8b',
  'status': 'completed'}]

There is output_text before function_call!

How can I achieve similar behavior using the Chat Completions API instead of the Responses API? Are there any best practices or prompt patterns to encourage the model to do some reasoning before calling tools?

Another question: in the Chat Completions API, can a response include both content and a function_call at the same time? If not, and the first response only returns content, how does the responses API know whether to send that to the user as a final answer or to make another completion call to get the actual tool call?


Compare the outputs from different APIs using the same prompt:

response = client.responses.create(
    instructions=SYS_PROMPT_SWEBENCH,
    model="gpt-4.1-2025-04-14",
    tools=[python_bash_patch_tool],
    input=f"Please answer the following question:\nBug: Typerror...",
    temperature=0.0,
)

print(json.dumps(response.to_dict()["output"], indent=2))
[
  {
    "id": "msg_685a9621cebc819d9729f778295afaf90c8bb82f11b4d9b1",
    "content": [
      {
        "annotations": [],
        "text": "Let's begin by clarifying and expanding on the problem statement:\n\n## Step 1: Deeply Understand the Problem\n\nThe only information provided is:  \n**Bug: Typerror...**\n\nThis is not enough detail to know:\n- Where the error occurs\n- What the full error message is\n- What code is involved\n\nHowever, a \"TypeError\" in Python usually means that a function or operation was applied to an object of an inappropriate type (e.g., adding a string to an integer, calling a function with the wrong number or type of arguments, etc.).\n\n## Step 2: Codebase Investigation\n\nSince the error is a TypeError, and the only clue is \"Typerror...\", I need to:\n- Search for \"TypeError\" in the codebase and/or test output.\n- Check for any recent test failures or error logs that mention TypeError.\n- Review the test suite to see if any tests are failing with a TypeError.\n\n### Plan:\n1. List the files in the `/testbed` directory to get an overview.\n2. Check for a test runner or test files (e.g., `run_tests.py`, `tests/`).\n3. Run the test suite to see if any TypeError is reported.\n4. If a TypeError is found, read the full error message and traceback to identify the location and cause.\n5. Investigate the relevant code section.\n\nLet's start by listing the files in the `/testbed` directory.",
        "type": "output_text"
      }
    ],
    "role": "assistant",
    "status": "completed",
    "type": "message"
  },
  {
    "arguments": "{\"input\":\"!ls /testbed\"}",
    "call_id": "call_6dDcCLcAPUbFVCrDwfiT35qr",
    "name": "python",
    "type": "function_call",
    "id": "fc_685a962b5bf0819d8acf41c128d399400c8bb82f11b4d9b1",
    "status": "completed"
  }
]
response = client.chat.completions.create(
    model="gpt-4.1-2025-04-14",
    messages=[
        {"role": "system", "content": SYS_PROMPT_SWEBENCH},
        {"role": "user", "content": "Please answer the following question:\nBug: Typerror..."}
    ],
    tools=[{"type": "function", "function": python_bash_patch_tool}],
    temperature=0.0,
)

print(json.dumps(response.choices[0].message.to_dict(), indent=2))
{
  "content": "It looks like your message is incomplete. You mentioned a \"Typerror\" bug, but didn't provide the full error message or any context about where it occurs, what code is involved, or what you were trying to do.\n\nTo help you fix the bug, I need more information. Please provide:\n- The full error message (including the stack trace, if possible)\n- The code snippet or file where the error occurs\n- A description of what you were trying to do when the error happened\n\nOnce you provide these details, I can investigate the issue and guide you through a solution!",
  "refusal": null,
  "role": "assistant",
  "annotations": []
}
1 Like

I have a similar issue when Responses return multiple messages + function_call. Read my thread

My guess is that the output list is prioritized, so the most relevant answer is the first one.

Here is a thought experiment for you. Which of these is more likely for the AI to write?

Case 1

(tool call) get_weather(“location”: “San Diego”}
“I invoked the weather tool, I’m now waiting for it to return results so I can tell you the weather”

Case 2

“I’ll use my weather tool to retrieve the conditions and forecast for you, stand by…”
(tool call) get_weather(“location”: “San Diego”}


In fact it is the later that is sensible. Calling a function is terminal; the AI must await the results in order for it to continue logically.

Models can write preamble text and then call a tool, also as a chain-of-thought. They are just not trained to do this unless a newer reasoning model. You must produce a function description mandating the announcement or that thought must be given whether the tool is appropriate and all information is complete. Then the instruction must tell the AI automatically to proceed to sending to the tool without awaiting user input.

System messages are less effective at causing “tool thought”, but you can still give a thinking container as instructions, where tools can be deliberated about.


Case 3

<reasoning>The user has asked about their weather, but didn’t specify a location. I see that get_weather() function requires a location I don’t know, but get_forecast() only requires a date range. Therefore the latter might be informed by the user’s location internally - sounds like a good plan.</reasoning>
“hang in there while I check the forecast for you…”
(tool call) get_forecast(“date_range”: 5)


The stream of events, encrypted reasoning, reasoning summary, output text, tool call and arguments, comes ordered as they are generated.

I believe the output list is in chronological order.

I’m wondering what OpenAI does differently in the Responses API that allows GPT-4.1 to write preamble text and then call a tool. When I use the same GPT-4.1 model and the same prompt via the Chat Completions API, the model only generates the preamble text—but doesn’t actually include the tool call in the response.

Does the Responses API add extra prompting behind the scenes? Or does it parse the response and decide whether to call the model again to generate the tool call?