Hi, I’ve got a File Search assistant connected to a 3MB vector store that contains ~ 94 PDFs. I can use this assistant without any problems in the playground and when I run a typical query against it in the playground the number of tokens generated is ~ 16k in and ~500 out. Tier1 says it has a limit of 30,000 tokens/minute. When try to run a query against the assistant via the API it returns status: incomplete, and reason “Max Tokens”. It doesn’t matter what I set the max_tokens parameter to when creating the run. Any tips would be greatly appreciated. I assume it’s something dumb and crucial I’m missing.
You aren’t allowed to directly set the actual max_tokens
that each AI run uses internally when the AI model is called iteratively.
Instead, there is more a supervisory token limit, that will abort a run if your budget is exceeded.
Find out what it is now: When you invoke a run at the values you are using, inspect the return object you immediately get back with metadata, including the token parameters that are being used as a cutoff point for that run (which I moved up to the top of this example return):
"temperature": 1.0,
"top_p": 1.0,
"max_prompt_tokens": 1000,
"max_completion_tokens": 1000,
"truncation_strategy": {
"type": "auto",
"last_messages": null
},
"id": "run_abc123",
"object": "thread.run",
"created_at": 1698107661,
"assistant_id": "asst_abc123",
"thread_id": "thread_abc123",
"status": "completed",
"started_at": 1699073476,
"expires_at": null,
"cancelled_at": null,
"failed_at": null,
"completed_at": 1699073498,
"last_error": null,
"model": "gpt-4o",
"instructions": null,
"tools": [{"type": "file_search"}, {"type": "code_interpreter"}],
"metadata": {},
"incomplete_details": null,
"usage": {
"prompt_tokens": 123,
"completion_tokens": 456,
"total_tokens": 579
},
"response_format": "auto",
"tool_choice": "auto",
"parallel_tool_calls": true
}
These are the ones you can crank way up to prevent an abort:
"max_prompt_tokens": 1000,
"max_completion_tokens": 1000,
Max prompt and max completion will shut off the assistant if they are exceeded. They should be high enough that only an AI gone crazy would trigger, preventing the response from being returned.
Tier 1 has quite low token limits. Just two internal calls before a response that use that 15k of information from the vector store can overrun your limit - forcing you to prepay more than you’d use to OpenAI for functional assistants.
To counter that, you can use a new parameter for limiting how much is injected from the file search each time it is used besides lowering the chunks of truncation_strategy. A relevance threshold for keeping non-related results out.
Thanks for taking the time and your detailed response. I will try again with the insight you’ve provided. I really appreciate it.
Gary
Gary Leydon
Director of Educational Technology
Center for Medical Education
203-737-6408
Great post. Would rate it FAQ worthy.