Openai web search token limit issue

I have the following API call:

response = openai_client.responses.create(
model=‘gpt-4o-mini’,
input=‘Your input prompt here’,
tools=[
{
“type”: “web_search_preview”,
“search_context_size”: “high”
}
]
)

and it mostly works but on occasions I get the following (error) response:

{“id”: “resp_67e158973b208191bc42b115727c0aa20e1a648aff9c28ee”, “created_at”: 1742821527.0, “error”: null, “incomplete_details”: {“reason”: “max_output_tokens”}, “instructions”: null, “metadata”: {}, “model”: “gpt-4o-mini-2024-07-18”, “object”: “response”, “output”: [{“id”: “ws_67e15897c464819194bcb16b1a31cbdb0e1a648aff9c28ee”, “status”: “completed”, “type”: “web_search_call”}], “parallel_tool_calls”: true, “temperature”: 1.0, “tool_choice”: “auto”, “tools”: [{“type”: “web_search_preview”, “search_context_size”: “high”, “user_location”: {“type”: “approximate”, “city”: null, “country”: “US”, “region”: null, “timezone”: null}}], “top_p”: 1.0, “max_output_tokens”: null, “previous_response_id”: null, “reasoning”: {“effort”: null, “generate_summary”: null}, “status”: “incomplete”, “text”: {“format”: {“type”: “text”}}, “truncation”: “auto”, “usage”: {“input_tokens”: 372, “input_tokens_details”: {“cached_tokens”: 0}, “output_tokens”: 16384, “output_tokens_details”: {“reasoning_tokens”: 0}, “total_tokens”: 16756}, “user”: null, “_request_id”: “req_1905918e6fdb561e37fcc310e6cbe5b4”}

It seems to be an error with the number of output tokens but there is no way to limit or control it. How should I resolve this or is this a bug?

1 Like

You are not asking for a JSON with structured output, so the AI filling a JSON with garbage and not closing it before the max_output_tokens is reached is not the issue.

However, what you are asking for is for an internal tool to be used. Once the AI has switched to sending to a tool recipient, the generated language is not for you. You can’t see that when writing the function arguments, the AI has gone crazy, filling the search query with nonsense up to the maximum length.

Mini model, high search results of distraction, unrestrained top_p: formula for bad sequences and repetitive patterns to be entered.

The Responses endpoint gives no frequency_penalty parameter that could break repetition.

Ok, so what should I tell it to do then? A simple: “Give me a summary of what you find on the internet with the following query…”?

You can make the model more “reliable”: use "top_p": 0.5 as a parameter. That eliminates low-certainty tokens generated from being sampled from.

The problem may be that the tool is specified internally using json mode or strict enforcement, and once it enters, and starts emitting tabs or newlines as it is known to do, you get an unbounded pattern.

They also seem to have an internal tool call iterator, otherwise the AI would not be able to click on links, so there are multiple ways that the AI could exceed the max_output_tokens, which is the maximum amount that you want to spend. Such as gpt-4o-mini being observed to call developer tools over and over without end.

I set the temperature to 0 and also was more specific with my prompt and now it does not seem to use so many tokens. Thanks.