First: the model context limit is 125k tokens, and then you have to subtract your max_completion_tokens from that to come up with the maximum model input space remaining. So keep inputs below what would simply return a model error from the API - not 150k tokens.
It is your usage tier that gives a TPM. You kind of have a “bucket” that refills at the specified rate. For GPT-4o at tier 2, the TPM is 450,000, then 800,000 at tier-3.
While one might think that “rate” wouldn’t count against you as much if you are receiving the discount and OpenAI is doing less computation, this is not the case. The rate limiter is a front end able to block requests before anything is processed by an AI model, including any cache lookup that would happen.
This does have the benefit of making the limits predictable - you aren’t going to need to guess whether you’ll get a cache hit or not based on length or expiration. It’s still just the normal token counting rate limit that is applied.
To demonstrate, I make a chatbot. For cache activation, maybe I should prime it on some tokens to be cached: The first act of a Shakespeare play, perhaps? Then I make it send my inputs twice.
Unfortunately, although I know it to be true that cache hits would make no difference to the remaining rate, and rate should decrease identically, at my high tier, the algorithm of “remaining tokens” refills so fast (to a point of being “reset” almost immediately) that subsequent calls show the same remaining tokens - just as if I write the code with a method to break the cache.
prompt> In act one, which character has the most monologues?
--- First API Request ---
x-ratelimit-remaining-tokens: 149990728
Usage info (first call): {"prompt_tokens": 9935,"completion_tokens": 111,"total_tokens": 10046,"prompt_tokens_details": {"cached_tokens": 0,"audio_tokens": 0},"completion_tokens_details": {"reasoning_tokens": 0,"audio_tokens": 0,"accepted_prediction_tokens": 0,"rejected_prediction_tokens": 0}}
--- Second API Request ---
x-ratelimit-remaining-tokens: 149990728
Usage info (second call): {"prompt_tokens": 9935,"completion_tokens": 105,"total_tokens": 10040,"prompt_tokens_details": {"cached_tokens": 9856,"audio_tokens": 0},"completion_tokens_details": {"reasoning_tokens": 0,"audio_tokens": 0,"accepted_prediction_tokens": 0,"rejected_prediction_tokens": 0}}
assistant> In Act One of "Cymbeline," the character with the most monologues is likely Imogen. She expresses her feelings and desires in several lengthy speeches, particularly in her interactions with Posthumus and her reflections on her situation regarding her father and her marriage. Imogen's lines convey her emotional depth and complexities surrounding love, loyalty, and her father's anger. Other characters, such as Cymbeline and the Queen, also have significant speeches, but Imogen's monologues are notably prominent in this act.
prompt>
Besides upgrading tier, you can slow down your requests. The TPM isn’t divided into discrete minutes, so a delay equivalent to what you sent will help.