Does prompt caching reduce TPM?

Hey everyone! I’m building a chatbot that uses gpt4o20240806 to provide suggestions/answers to the user, based on data I feed it in the system prompt. This data, is around 100-150k tokens.

While testing, I keep getting rate-limited due to TPM. The first time I see that cachedTokens is around 100% of the sent tokens, but in the next requests I’ll get the TPM rate-limited error.

Is there something I’m doing wrong?

First: the model context limit is 125k tokens, and then you have to subtract your max_completion_tokens from that to come up with the maximum model input space remaining. So keep inputs below what would simply return a model error from the API - not 150k tokens.

It is your usage tier that gives a TPM. You kind of have a “bucket” that refills at the specified rate. For GPT-4o at tier 2, the TPM is 450,000, then 800,000 at tier-3.

While one might think that “rate” wouldn’t count against you as much if you are receiving the discount and OpenAI is doing less computation, this is not the case. The rate limiter is a front end able to block requests before anything is processed by an AI model, including any cache lookup that would happen.

This does have the benefit of making the limits predictable - you aren’t going to need to guess whether you’ll get a cache hit or not based on length or expiration. It’s still just the normal token counting rate limit that is applied.


To demonstrate, I make a chatbot. For cache activation, maybe I should prime it on some tokens to be cached: The first act of a Shakespeare play, perhaps? Then I make it send my inputs twice.

Unfortunately, although I know it to be true that cache hits would make no difference to the remaining rate, and rate should decrease identically, at my high tier, the algorithm of “remaining tokens” refills so fast (to a point of being “reset” almost immediately) that subsequent calls show the same remaining tokens - just as if I write the code with a method to break the cache.

prompt> In act one, which character has the most monologues?

--- First API Request ---
x-ratelimit-remaining-tokens: 149990728
Usage info (first call): {"prompt_tokens": 9935,"completion_tokens": 111,"total_tokens": 10046,"prompt_tokens_details": {"cached_tokens": 0,"audio_tokens": 0},"completion_tokens_details": {"reasoning_tokens": 0,"audio_tokens": 0,"accepted_prediction_tokens": 0,"rejected_prediction_tokens": 0}}

--- Second API Request ---
x-ratelimit-remaining-tokens: 149990728
Usage info (second call): {"prompt_tokens": 9935,"completion_tokens": 105,"total_tokens": 10040,"prompt_tokens_details": {"cached_tokens": 9856,"audio_tokens": 0},"completion_tokens_details": {"reasoning_tokens": 0,"audio_tokens": 0,"accepted_prediction_tokens": 0,"rejected_prediction_tokens": 0}}

assistant> In Act One of "Cymbeline," the character with the most monologues is likely Imogen. She expresses her feelings and desires in several lengthy speeches, particularly in her interactions with Posthumus and her reflections on her situation regarding her father and her marriage. Imogen's lines convey her emotional depth and complexities surrounding love, loyalty, and her father's anger. Other characters, such as Cymbeline and the Queen, also have significant speeches, but Imogen's monologues are notably prominent in this act.
prompt> 

Besides upgrading tier, you can slow down your requests. The TPM isn’t divided into discrete minutes, so a delay equivalent to what you sent will help.

1 Like

I see what you’re saying and thank you for such a detailed response!

The thing is, it’s not a good user experience to have the user wait, especially if they are paying. What other techniques are there?

I saw a recommendation of storing and querying embeddings, but I’m not sure if it can help in my case. Each row on my excel file, consists of 70 columns, containing data and statistics of soccer matches. Could that help?

If you are discussing providing a service to users: Tier 4, giving two million tokens per minute of rate limit, is having paid OpenAI $250 total and 14+ days since first successful payment at the time of making that latest prepayment. Consider that OpenAI can have a single ChatGPT user paying them $200 per month, to put things in perspective.

But you are correct, using a retrieval database or search service to provide only what is relevant to the AI based on the user input is far more cost-efficient than “read this book every time, just to possibly answer one question from it”.

Semantic search doesn’t work well on raw data such as Excel statistics turned into CSV, so you might need more advanced tools to provide the AI the ability to write its own queries.

1 Like

Any suggestions on how to do that?