Cache not caching more than 1024 tokens (expected: increments of 128 tokens)

I’m using the gpt-4o model with a long system prompt.

According to the gpt-4o tokenizer , my prompt length is 1181 tokens.

Since the docs say that caching is done in increments of 128 tokens, I’d expect 1024+128=1152 tokens to be cached. However, I consistently get only 1024 tokens cached.

Why is that?

Code:

system_prompt = "- You are an email marketing expert. - Your goal is to create email content per user's request. - To do that, use email.json given below as follows: 1. Set 'subject' with a content fitting user's request, constructing it such that it leads to high engagement. 2. Use 'blocks' as a guide for available email building blocks, choosing freely those fitting user's request (you can use the same block multiple times), and follow strictly the JSON output format.- Notes regarding 'fields': 1. All 'fields' must be filled with value. 2. As for the 'subject' and 'content' fields: a. Their language should match the one speficied by the user's for the email. Otherwise, use the user's detected language. b. Use personalization when you find fitting, using the placeholder [[first_name]]. 3. The 'type' field should always have a plain string value. in cases the json defines it as a string array (e.g. the 'socialLink' block),              you should select the one plain string value out of this array that fits the user request. 4. The 'total_blocks' fields determine the total number of blocks you should include. You must follow the user given amounts, unless value of -1 is chosen, which means it's up to you to determine. Note: No need to echo this field back in your response. 5. The 'requires_group' field indicates that the block type must be nested under a group block type (see below). 6. For 'url' typed fields, use the url provided by the user, but if they don't provide one: - For 'image' url fields, unless already containing other value, use https://placehold.co/WxH/png as placeholder, setting W and H dimensions as you see fit. - For 'linkUrl' url fields, unless already containing other value, use https://google.com as a placeholder. 7. For 'address' typed fields, use the address provided by the user, but if they don't provide one: - If the detected user language is Hebrew, use: 'רחוב הרצל 1, תל אביב'. Else, use '1 Herzl st, TLV'. - Otherwise, for any address provided by the user, format it in the same way given above. - Notes regarding 'attributes': 1. The 'layout' field is defined in the system prompt as an array just to provide you with its options. However, if you use it in your response, return it as a string, e.g. 'layout': 'v'. - Notes regarding blocks: - 1. The 'group' block is special: it nests blocks of other types, such that: a. The group's blocks can only be of a certain block type. An example of such nested 'blocks': [{ 'type':'bullet', ... , 'fields':{'content':'a'}, {'type':'bullet', ... , 'fields':{'content':'b'}}]. b. The group's 'attributes' field must contain a 'layout', where 'h' is horizontal, 'v' is vertical, and 'hv' or 'vh' are a mix. - 2. The 'other' block is used to represent blocks otherwise not represented. This type is only used by the 'user', so just echo it if received, but don't generate such blocks yourself. - Your responses must be iterative, so that with each additional 'user' prompt, you take the previous 'assistant' response and modify it according to the user request. - The first 'assistant' message contains the current email contents, so base your first response on it, unless otherwise requested by the user. - Output should be JSON only. - Here's email.json: {'subject':'string','blocks':[{'type':'header','id':'guid','total_blocks':-1,'fields':{'title':'string'}},{'type':'paragraph','id':'guid','total_blocks':-1,'fields':{'content':'string'}},{'type':'image','id':'guid','total_blocks':-1,'fields':{'image':'url','title':'string'}},{'type':'actionButton','id':'guid','total_blocks':-1,'fields':{'title':'string'}},{'type':'article','id':'guid','total_blocks':-1,'fields':{'title':'string','content':'string','image':'url'}},{'type':'event','id':'guid','total_blocks':-1,'fields':{'title':'string','content':'string','location':'address','startDate':'DD/MM/YYYY HH:MM','endDate':'DD/MM/YYYY HH:MM','linkTitle':'string','linkUrl':'url','showDates':true}},{'type':'product','id':'guid','total_blocks':-1,'fields':{'title':'string','content':'string','image':'url','price':'string','linkTitle':'string','linkUrl':'url'}},{'type':'bullet','id':'guid','requires_group':true,'fields':{'content':'string'}},{'type':'socialLink','id':'guid','requires_group':true,'fields':{'type':['facebook','youtube','linkedin','instagram','whatsapp','x','tiktok','telegram','website'],'url':'url'}},{'type':'logo','id':'guid','fields':{'title':'string','website':'url','image':'url'}},{'type':'divider','id':'guid','total_blocks':-1},{'type':'spacer','id':'guid','total_blocks':-1},{'type':'group','id':'guid','blocks':['bullet','article','product','socialLink'],'attributes':{'layout':['h','v','hv','vh']}},{'type':'other','id':'guid'}]}"

user_prompt = "Create a mail for marketing my Yoga class starting next week"

conversation = [
   {"role": "system", "content": system_prompt }
   {"role": "user", "content": user_prompt }
]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=conversation,
    temperature=1,
    response_format={ "type": "json_object" },
)
1 Like

I experience the same behavior:

  1. I had to increase the size of my initial prompt to 1200 tokens so that the caching started to work, and my next round-trip to gpt-4o-2024-08-06 returned a chat completion object with “prompt_tokens_details”: {“cached_tokens”: 1024}. It looks like gpt-4o tokenizer and gpt-4o-2024-08-06 use different encodings although according to docs they both should work with the same o200k_base.

  2. Documentation says:

Cached prefixes generally remain active for 5 to 10 minutes of inactivity. However, during off-peak periods, caches may persist for up to one hour.

In most cases, the prompt is actually cached. And I see in logs that even new sessions within 10 minutes reuse the cache. But sometimes within the same session the initial prompt is not cached, and the next round-trip within 30 seconds doesn’t pick up the cache. I don’t understand why.

Seems to be working for me. Typing at normal speed here.

It is based on the input tokens of the previous turn, not generation which did not get processed as input previously.

prompt tokens: 136 - not cached in following turn
prompt tokens: 1081 - 1024 cached in the following turn.

And the initial concern, I guess we have to try that out also…

Previously 1185 tokens of input: 1152 cached.

1 Like

I’m working with OpenAI API via official .NET package, does it make any difference?

The messages array is constructed this way:

  var messages = new List<ChatMessage>()
  {
    ChatMessage.CreateSystemMessage(prompt)
  };
  messages.AddRange(conversationHistory);
  if (ragContext.Any())
  {
    messages.Add(ChatMessage.CreateSystemMessage(ragContext));
  }
  • prompt is constant of 1200 tokens (measured with gpt-4o tokenizer) to enable prompt caching from the start of conversation
  • conversationHistory is a list of user questions and assistant answers
  • ragContext is the result of the vector search

@_j I will be grateful for your insights or suggestions if any on how to improve this workflow.

Capture the full response made by your initial request, and obtain the usage at the top level of the response, or enable it as the last streaming chunk, as what I just demonstrated (python)

if 'usage' in chunk and not chunk.get('choices'):
    state.usage = chunk['usage']
    return

You’ll obtain the usage:

Or you scroll up in your Python REPL console, copy the exact same request to the prompt line to send it again - and get different token accounting and no cache, and say WTF.

Try a third time, there’s a cache:

(The AI reply to pasting 200 lines of console was “It sounds like you’re having an interesting discussion about OpenAI’s recent updates…”)

So a curious person could actually program repeatability instead of pasting nonsense to the bot, and see if there is a fluke and where it arises. Send 60 different requests and repeat and enjoin them throughout an hour to find that expiration. To no end, really, other than “we didn’t guarantee it”.

In your particular request, it seems like you have:
system: {prompt} - 1200 tokens
(turns)
system: “document”

You can send the system message alone and see what it yields by API for usage. That will be the repeated part that is guaranteed to have the same initial token sequence if all goes to plan.

This is exactly what I do to monitor caching.

So, I’m not the only one who experience this strange behavior. It’s like a different encoding comes into play.

@_j I have sent the system message alone to gpt-4o-2024-08-06 ten times:


  1. ASSISTANT 20:01:27

{“usage”: {“total_tokens”: 1268, “prompt_tokens”: 1244, “completion_tokens”: 24, “prompt_tokens_details”: {“audio_tokens”: 0, “cached_tokens”: 0}, “completion_tokens_details”: {“audio_tokens”: 0, “reasoning_tokens”: 0, “accepted_prediction_tokens”: 0, “rejected_prediction_tokens”: 0}}}

  1. ASSISTANT 20:01:31

{“usage”: {“total_tokens”: 1268, “prompt_tokens”: 1244, “completion_tokens”: 24, “prompt_tokens_details”: {“audio_tokens”: 0, “cached_tokens”: 1024}, “completion_tokens_details”: {“audio_tokens”: 0, “reasoning_tokens”: 0, “accepted_prediction_tokens”: 0, “rejected_prediction_tokens”: 0}}}

  1. ASSISTANT 20:01:34

{“usage”: {“total_tokens”: 1268, “prompt_tokens”: 1244, “completion_tokens”: 24, “prompt_tokens_details”: {“audio_tokens”: 0, “cached_tokens”: 1024}, “completion_tokens_details”: {“audio_tokens”: 0, “reasoning_tokens”: 0, “accepted_prediction_tokens”: 0, “rejected_prediction_tokens”: 0}}}

  1. ASSISTANT 20:01:37

{“usage”: {“total_tokens”: 1295, “prompt_tokens”: 1244, “completion_tokens”: 51, “prompt_tokens_details”: {“audio_tokens”: 0, “cached_tokens”: 1024}, “completion_tokens_details”: {“audio_tokens”: 0, “reasoning_tokens”: 0, “accepted_prediction_tokens”: 0, “rejected_prediction_tokens”: 0}}}

  1. ASSISTANT 20:01:41

{“usage”: {“total_tokens”: 1271, “prompt_tokens”: 1244, “completion_tokens”: 27, “prompt_tokens_details”: {“audio_tokens”: 0, “cached_tokens”: 1024}, “completion_tokens_details”: {“audio_tokens”: 0, “reasoning_tokens”: 0, “accepted_prediction_tokens”: 0, “rejected_prediction_tokens”: 0}}}

  1. ASSISTANT 20:02:25

{“usage”: {“total_tokens”: 1268, “prompt_tokens”: 1244, “completion_tokens”: 24, “prompt_tokens_details”: {“audio_tokens”: 0, “cached_tokens”: 0}, “completion_tokens_details”: {“audio_tokens”: 0, “reasoning_tokens”: 0, “accepted_prediction_tokens”: 0, “rejected_prediction_tokens”: 0}}}

  1. ASSISTANT 20:02:28

{“usage”: {“total_tokens”: 1293, “prompt_tokens”: 1244, “completion_tokens”: 49, “prompt_tokens_details”: {“audio_tokens”: 0, “cached_tokens”: 1024}, “completion_tokens_details”: {“audio_tokens”: 0, “reasoning_tokens”: 0, “accepted_prediction_tokens”: 0, “rejected_prediction_tokens”: 0}}}

  1. ASSISTANT 20:02:34

{“usage”: {“total_tokens”: 1295, “prompt_tokens”: 1244, “completion_tokens”: 51, “prompt_tokens_details”: {“audio_tokens”: 0, “cached_tokens”: 1024}, “completion_tokens_details”: {“audio_tokens”: 0, “reasoning_tokens”: 0, “accepted_prediction_tokens”: 0, “rejected_prediction_tokens”: 0}}}

  1. ASSISTANT 20:02:37

{“usage”: {“total_tokens”: 1295, “prompt_tokens”: 1244, “completion_tokens”: 51, “prompt_tokens_details”: {“audio_tokens”: 0, “cached_tokens”: 1024}, “completion_tokens_details”: {“audio_tokens”: 0, “reasoning_tokens”: 0, “accepted_prediction_tokens”: 0, “rejected_prediction_tokens”: 0}}}

  1. ASSISTANT 20:02:40

{“usage”: {“total_tokens”: 1277, “prompt_tokens”: 1244, “completion_tokens”: 33, “prompt_tokens_details”: {“audio_tokens”: 0, “cached_tokens”: 1024}, “completion_tokens_details”: {“audio_tokens”: 0, “reasoning_tokens”: 0, “accepted_prediction_tokens”: 0, “rejected_prediction_tokens”: 0}}}


So, prompt token counting is consistent but token caching is not (the 6th round-trip consumed zero cached tokens).