The maximum number of prompt tokens that may be used over the course of the run. The run will make a best effort to use only the number of prompt tokens specified, across multiple turns of the run. If the run exceeds the number of prompt tokens specified, the run will end with status incomplete . See incomplete_details for more info.
What exactly does this mean?
A: only include the last X number of messages in the thread until max_prompt_tokens is reached
B: limit the size of a conversation summary that is included in the prompt
max_token_prompt was very confusing to me at first, but keep digging in, you can have my custom gpt teach you about it more at link below. But from my understanding, this is the maximum number of tokens that will be sent as your prompt and includes your current prompt, the past messages, and the overhead like instructions and behind the scenes tokens. So if you set to say 50k for example, and your current thread is already at 200k (say you’ve been having a long convo), you obviously can’t send 200k tokens in a prompt, so it has to truncate that down using the truncation strategy (default is auto) small enough so that your current prompt tokens and all the other helper tokens, plus all the info from the past messages (truncated) is no more than 50k tokens. There are different truncation strategies you can use to “summarize” all the previous messages to be sent along with your current prompt to the model. Hope that helps with the concept a little. And max_completion_tokens is the maximum the model will respond with, fyi.
What it means is that the cumulative input to AI models over all iterative calls during progress of a run is totaled, and if your threshold is exceeded, it errors out and you get nothing back.
This can simply keep an AI from going crazy and serving you up a massive bill, like were the AI continuing to try to run the same failing python script over and over, or searching vector stores multiple times because it doesn’t believe that there are no manatee facts to be returned like you said were uploaded, with inability to break from a pattern.
As a safety feature it should be set high, so you are protected from excess usage, not protected from getting results that were still in progress on your dime.