Assistant API v2: max_prompt_tokens gets exceeded, barely, consistently

Today, I have had a failed runs that end on incomplete with “max_prompt_tokens”, where the limit get’s exceeded, just barely, but consistently, independent of where I the limit (tried between 25000 to 28000 for max_prompt_tokens). See below. Any clues to what’s going on? It feels like there is a miscalculation of tokens on the OpenAI end?

{
“id”: “run_REDACTED”,
“object”: “thread.run”,
“created_at”: 1716564198,
“assistant_id”: “asst_REDACTED”,
“thread_id”: “thread_REDACTED”,
“status”: “incomplete”,
“started_at”: 1716564199,
“expires_at”: null,
“cancelled_at”: null,
“failed_at”: null,
“completed_at”: 1716564204,
“required_action”: null,
“last_error”: null,
“model”: “gpt-4o”,
“instructions”: “REDACTED”,
“tools”: [
{
“type”: “file_search”
}
],
“tool_resources”: {},
“metadata”: {},
“temperature”: 1.0,
“top_p”: 1.0,
“max_completion_tokens”: 3000,
“max_prompt_tokens”: 26000,
“truncation_strategy”: {
“type”: “auto”,
“last_messages”: null
},
“incomplete_details”: {
“reason”: “max_prompt_tokens”
},
“usage”: {
“prompt_tokens”: 25964,
“completion_tokens”: 88,
“total_tokens”: 26052
},
“response_format”: “auto”,
“tool_choice”: “auto”
}

The max prompt tokens isn’t providing information of how to operate. It is providing information of when to produce an error.

You provided no limit to the length of thread conversation with the limited ability of truncation_strategy. Then you have file_search enabled, where the AI will call a tool and get back up to 20 chunks by 800+overlap tokens. So the error you asked for blocks the run, leaving it incomplete.

1 Like

From the api docs on truncation strategy:

"The truncation strategy to use for the thread. The default is auto. […] When set to auto, messages in the middle of the thread will be dropped to fit the context length of the model, max_prompt_tokens. "

I have not set truncation strategy, so it defaults to auto and should honor max_prompt_tokens. Am I misunderstanding?

When I change max_prompt_tokens, the tokens used for the prompt do change with it. Only it misses, barely.

The problem persists. Any ideas are welcome.

The max prompt tokens is only there to produce an error and block usage when the AI exceeds your setting.

If you want to actually control the length of past chat which can increase the usage, you must use truncation strategy.

OpenAI basically broke assistants by making the tier 1 rate limit per minute even lower than the amount an assistant with documents can use in a single run. I would suggest building on Chat Completions rather than rewarding with $50+ to raise tier this crude behavior by OpenAI that does not even allow someone to send 1/3 the model context. That gives you full control of the amount sent to a model every call.

1 Like

I must have read the documentation a billion times and never realized that was the case. Thought for sure it was using that as a parameter, not as a means to strike an error.