I have occasionally read that assigning a high value to max_tokens has performance implications.
Can anyone with expert knowledge confirm or correct this anecdote?
Note: I am generating an accurate pre count of request_tokens (using tiktoken).
What I’m trying to ascertain is whether it makes sense to set max_tokens = model_token_limit - request_tokens or to estimate a lower value, if doing so will improve API response times.
There might be more delays if you demand longer responses(that’s what max_tokens for anyway).
In my product, users are able to set their own max_tokens and I calculate the final max_tokens with your formula: max_tokens = model_token_limit - request_tokens