Your rate limit is calculated as the maximum of max_tokens and the estimated number of tokens based on the character count of your request. Try to set the max_tokens value as close to your expected response size as possible.
Or for the chat endpoint: don’t set or send the max_tokens parameter at all. Then you don’t get tokens you don’t use counted against the rate. All non-input context length can be used for generating a response.
This answer should be pinned to the homepage with the largest font available. Thank you @_j for finally explaining to me why I am getting random 429 when I am nowhere near any usage limits.
In my case, I was receiving 429 for longer prompts, even though the max_tokens was the same for all requests.
I find the way the tokens are calculated towards the limit super confusing and even ChatGPT was not able to point me in the right direction!