GPT-5 Reasoning Effort Impact on Agent Performance

I’m building an agent on top of the Agents SDK with gpt-5 as my underlying model.

I’m wondering if anyone else is doing the same as has found any evidence of performance impacts from switching reasoning_effort from default “medium” to “low” or even “minimal”? If not personal experience, have you seen any good research on this?

Our agent is a knowledge work assistant that is primarily used to gather and synthesis data from a bunch of sources. The agent has a handful of high level tools it uses to retrieve context.

I’ve been experimenting with the effects of changing the reasoning effort param from “medium” to “low”

The primary motivation for this is that when I look at my agent traces, very little of the latency is due to tool calls. Its almost all the model sitting thinking before or after tool calls.

The experiments have been interesting. Some findings so far:

  • On high complexity tasks requiring lots of context gathering and synthesis, latency drops significantly when turning reasoning down. Quality does not seem to degrade much if at all.
  • For less complex questions, latency does drop with lower reasoning, but not by a very impactful amount.
  • Sure enough, low reasoning burns way less reasoning tokens even if the model takes the exact same tool calling path.

Based on these results I started to think, great, I can get away with low reasoning effort. However:

  • Sometimes with low reasoning, the agent misses a critical step / tool call that ultimately cause it to produce an incorrect final response. An example: it does one batch of tool calls and gets partial information. On medium reasoning the model would do another batch and attempt to clarify ambiguities. On low it does not.

My experiments have been small in scale so far. If anyone else has insights here I’d love to hear them.

If your agent is performing simple tasks (extracting simple text fields or sentiment, chatting about a support ticket, etc.) then reasoning effort low is probably fine, and will be much much faster. Use reasoning levels set higher when you want the model to think deeply about a question. For example, if you want it to look at an entire tech support chat was effective at applying training materials correctly. To some extent you can set reasoning low and gather information quickly, then raise the reasoning level to make sure things are on track. You can even do this in parallel so the user is experiencing an interactive discussion but there is a smart supervisor in the background. If your use case is not a chat-style agent, you can still use a low reasoning effort to quickly determine whether or not a long think is needed. If so, call the agent again with reasoning cranked up.

I agree with all of this just based on my experiments and what I’ve read elsewhere.

But I’m curious if there’s data to demonstrate this. I.e. how much does switching reasoning effort a step impact number of tool calls, end to end accuracy, etc.

Great point… data vs anecdotes! I do not have recorded data on the granularity of reasoning levels… its multi-variate since you would also have to track max-tokens, temperature, etc. to plot this out. OpenAI may have it… Useful experiments would not be hard but could take some time and cost, especially if there’s any kind of temperature setting>0. I spend $ on reliability experiments (ie can I get model X to produce result Y >> 99% of the time) or in comparing models so I can tell a customer that model X “is better than” model Y at task Z. Post back if you find reasoning comparison data anywhere please!