Prompt caching (automatic!)

jeffsharris · October 1, 2024, 11:42pm

Many apps pass the same context over and over again to the model, for instance a codebase, product information, a bunch of function calls, or a long multi-turn conversation history. Starting today, we will give you a 50% discount for input tokens that the model has seen recently, no action required. Starts at 1024 cached tokens and available on our latest snapshots of GPT-4o, GPT-4o mini, o1-preview, and o1 min.

Read more in our docs.

stevenic · October 1, 2024, 11:45pm

By far the most immediately useful feature you launched today. Love that it’s automatic.

anon25271712 · October 2, 2024, 12:01am

thank you so much for the community post on this!!

I’ve just tried using it, unfortunately it didn’t work, so is this something that will be added later today, perhaps along the week or as we go? I used gpt-4o for my tests.

Edit: I’ve also noticed the python api hasn’t been updated in 6h, so maybe this will still be rolling out. I keep getting Cached Tokens: 0 / 3872

Here is the code in case anyone wants to check it out:

import time
from openai import OpenAI
import tiktoken

# Initialize the OpenAI client with the provided API key
client = OpenAI(api_key="do-not-commit-to-github-like-this-otherwise-your-api-key-will-leak")

# Function to get the precise token count using tiktoken
def precise_token_count(text, model="gpt-4o"):
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

# Static system instructions to ensure caching threshold is met (above 1024 tokens)
system_instructions = """
You are a helpful assistant. You provide informative, concise answers. Additionally, you are programmed to deliver responses in a polite, professional tone. Your purpose is to assist users with a variety of queries, ranging from simple to complex. To ensure clarity, please structure your answers in a step-by-step format when applicable. Moreover, if a user asks a question that can be solved with a specific formula, ensure to include it in your response. When presented with general questions, attempt to offer examples or analogies to make concepts easier to grasp.

The goal of your interaction is not only to provide correct answers but to do so in a way that makes the user feel comfortable and confident in the responses received. Your language should be friendly yet formal. Always consider context and provide relevant additional details that could help the user achieve their goals.

If a user poses a technical question, particularly one related to programming or mathematics, ensure to break down the solution into understandable parts. For math, show each step clearly, and for programming, provide commented code when appropriate. Your mission is not just to give an answer but to offer an explanation that empowers the user with deeper understanding.

Furthermore, when responding to complex inquiries, avoid overwhelming the user with too much information at once. Instead, provide bite-sized responses that can be easily digested. If a user requires further explanation, invite them to ask for clarification or additional details.

You can also offer additional resources or links if the user would like to read more about the topic you are discussing. By doing so, you act as both an immediate solution provider and a guide for deeper exploration.

Please keep all responses under 1000 characters unless the user specifically asks for more. In the case where an answer exceeds this limit, ensure to mention it at the end of your response, offering a brief summary upfront, followed by details. Your tone should remain friendly, engaging, and informative at all times.
""" * 10  # Repeat to ensure it's well over 1024 tokens

# Precise token count of the system message using tiktoken
system_tokens_precise = precise_token_count(system_instructions)
print(f"Precise token count for system instructions: {system_tokens_precise}\n")

def test_prompt_caching(dynamic_content, iteration):
    # Estimate the token count of the dynamic content
    dynamic_tokens_precise = precise_token_count(dynamic_content)
    total_tokens_precise = system_tokens_precise + dynamic_tokens_precise
    print(f"=== Iteration {iteration} ===")
    print(f"User input: {dynamic_content}")
    print(f"Precise total tokens: {total_tokens_precise}")
    
    # Print system instructions for verification
    print("\nSystem Instructions (trimmed):")
    print(system_instructions)  # Print first 500 characters of the system message for clarity
    
    # Start timer to measure latency
    start_time = time.time()

    # Make the API request
    completion = client.chat.completions.create(
        model="gpt-4o",  # Use the gpt-4o model as specified
        messages=[
            {
                "role": "system",
                "content": system_instructions  # Static system message for caching
            },
            {
                "role": "user",
                "content": dynamic_content  # Dynamic user input (varying part)
            }
        ]
    )
    
    # Measure response time
    response_time = time.time() - start_time

    # Get usage information, including cached tokens
    usage_data = completion.usage

    # Check and print usage details
    if hasattr(usage_data, 'prompt_tokens'):
        print(f"Prompt Tokens: {usage_data.prompt_tokens}")
    else:
        print("Prompt tokens data not available.")

    if hasattr(usage_data, 'completion_tokens'):
        print(f"Completion Tokens: {usage_data.completion_tokens}")
    else:
        print("Completion tokens data not available.")

    if hasattr(usage_data, 'prompt_tokens_details') and hasattr(usage_data.prompt_tokens_details, 'cached_tokens'):
        cached_tokens = usage_data.prompt_tokens_details.cached_tokens
        print(f"Cached Tokens: {cached_tokens} / {usage_data.prompt_tokens}")
    else:
        print("Cached tokens data not available or caching not applied.")

    # Print response
    print(f"Response: {completion.choices[0].message.content}")
    print(f"Response Time: {response_time:.4f} seconds\n")

# Test with multiple dynamic inputs to see caching behavior
previous_messages = "Hi there! This is test number 1. "  # Base user message
for i in range(2, 7):
    # Increase dynamic content by keeping previous and adding new message
    dynamic_content = previous_messages + f"Hi there! This is test number {i}."
    
    # Run the caching test
    test_prompt_caching(dynamic_content, i)
    
    # Wait a bit between requests to simulate different requests
    time.sleep(2)

thinktank · October 2, 2024, 12:30am

I think this is a great idea! It really makes “Prompt Engineering” more engineering-ish. There are so many nuanced and exacting, yet regularized tasks that this will make accessible to small businesses.

anon22939549 · October 2, 2024, 1:08am

Will the discount stack with a batch job if there are cachable prompts in the batch?

anon22939549 · October 2, 2024, 1:12am

It’s pretty awesome.

It’ll be even more awesome when it’s more than 1024 tokens, then multi-turn chat sessions will come with ~50% off input tokens.

anon22939549 · October 2, 2024, 1:15am

I believe there is an “automated prompt engineering” entry coming to the cookbook soon.

Enrico · October 2, 2024, 1:27am

It’s working, although the 5 minutes are not guaranteed - it’s more like 5 seconds(!). More often than not, after a few seconds the cache is cold already.

Or maybe it’s a CDN issue, and we get fresh machines 50% of the times.

o1-mini seems the most predictable at the moment with ~80% hit rate, with the cache being hit regularly. gpt-o4 and gpt-4o-mini have a ~50% hit rate for me (sending exactly the same request of ~1500 tokens).

will15 · October 2, 2024, 1:52am

From the docs:

Discounting for Prompt Caching is not available on the Batch API

anon22939549 · October 2, 2024, 2:14am

Ahhh! Thanks. That’s exactly what I expected!

anon25271712 · October 2, 2024, 6:57am

interesting, thanks for sharing this! I’ll give it another try tomorrow

platypus · October 2, 2024, 7:08am

Hi @jeffsharris !

According to the guide, prompt cache is accessed within the same organization. So this is not tied per API key? I.e. if we have a bunch of API keys (e.g. projects) under the same organization, and two projects use the same structured outputs schema, but on different data, they will technically hit the cache (assuming the min token limits and eviction timing conditions are met)?

Also: how is “prefix” derived? Is it the first X amount of tokens, and what is this X?

MARK0 · October 2, 2024, 8:44am

Prompt caching is great addition to API. However, what I would really love to have is an option to control the cache by myself over the API. This would allow me to create private persistent cache, a conversation starting point for my AI assistants / agents that would encapsulate complex and large prompts.

stevenic · October 2, 2024, 9:11am

I’m assuming they can’t really support that from a cost perspective. They’re basically having to load your prompt into the models KV cache on one machine and then direct future model calls to that same machine. That’s why they say it’s best effort. If they start getting a ton of large requests in they’ll need to start evicting caches to free up space.

I’m actually a bit surprised they elected to make this automatic versus opt in. I say that because I do a ton of rag requests (thousands) that are larger then 1k tokens in size and it’s highly unlikely I’m going to make the same request again so there’s no point in caching my request.

The automatic nature of things means that a prompt that might actually benefit from caching is at risk of being evicted before my prompt which I can guarantee will never be called again. Doesn’t make a lot of sense…

stevenic · October 2, 2024, 9:15am

I’m going to be honest… I love all of the innovation and this feature is great but it, like a lot of the previous features, feels a bit tacked on and not very well thought out.

MARK0 · October 2, 2024, 9:19am

I see. Well, Anthropic allows developers to manage their private prompt caches. I think that it is still an interesting feature so I was hoping to see something similar in OpenAI’s API.

stevenic · October 2, 2024, 9:29am

Anthropic took the opt-in approach to prompt caching which dramatically changes your design choices given the reduced load that will come with that…

And if you look at Anthropics pricing (I just did) they have to charge you 25% more for the initial prompt cache write given that they’re reserving memory for you. You’ll get this back over the course of future reads but you do have to pay more initially…

stevenic · October 4, 2024, 3:44am

@jeffsharris i was just thinking about prompt caching and it dawned on me this is how you’re making o1’s thinking efficient…. You’re caching the base prompt then you append each thinking step to the cached prompt. This essentially extends the length of the cached prompt. Clever…

anon22939549 · October 9, 2024, 6:33am

This topic was automatically closed after 5 days. New replies are no longer allowed.

Topic		Replies	Views
Prompt Caching for o3-mini? API o3-mini	3	279	February 4, 2025
How Prompt caching works? API assistants-api , prompt-caching	17	4169	February 4, 2025
Prompt Token Cache Gaming to Save Money? API prompt-caching	1	430	October 18, 2024
Realtime API pricing is wrong, will overcharge API realtime	36	2966	January 15, 2025
Prompt caching with multiple agents API	1	373	October 9, 2024

Prompt caching (automatic!)

Related topics