thank you so much for the community post on this!!
I’ve just tried using it, unfortunately it didn’t work, so is this something that will be added later today, perhaps along the week or as we go? I used gpt-4o for my tests.
Edit: I’ve also noticed the python api hasn’t been updated in 6h, so maybe this will still be rolling out. I keep getting Cached Tokens: 0 / 3872
Here is the code in case anyone wants to check it out:
import time
from openai import OpenAI
import tiktoken
# Initialize the OpenAI client with the provided API key
client = OpenAI(api_key="do-not-commit-to-github-like-this-otherwise-your-api-key-will-leak")
# Function to get the precise token count using tiktoken
def precise_token_count(text, model="gpt-4o"):
encoding = tiktoken.encoding_for_model(model)
return len(encoding.encode(text))
# Static system instructions to ensure caching threshold is met (above 1024 tokens)
system_instructions = """
You are a helpful assistant. You provide informative, concise answers. Additionally, you are programmed to deliver responses in a polite, professional tone. Your purpose is to assist users with a variety of queries, ranging from simple to complex. To ensure clarity, please structure your answers in a step-by-step format when applicable. Moreover, if a user asks a question that can be solved with a specific formula, ensure to include it in your response. When presented with general questions, attempt to offer examples or analogies to make concepts easier to grasp.
The goal of your interaction is not only to provide correct answers but to do so in a way that makes the user feel comfortable and confident in the responses received. Your language should be friendly yet formal. Always consider context and provide relevant additional details that could help the user achieve their goals.
If a user poses a technical question, particularly one related to programming or mathematics, ensure to break down the solution into understandable parts. For math, show each step clearly, and for programming, provide commented code when appropriate. Your mission is not just to give an answer but to offer an explanation that empowers the user with deeper understanding.
Furthermore, when responding to complex inquiries, avoid overwhelming the user with too much information at once. Instead, provide bite-sized responses that can be easily digested. If a user requires further explanation, invite them to ask for clarification or additional details.
You can also offer additional resources or links if the user would like to read more about the topic you are discussing. By doing so, you act as both an immediate solution provider and a guide for deeper exploration.
Please keep all responses under 1000 characters unless the user specifically asks for more. In the case where an answer exceeds this limit, ensure to mention it at the end of your response, offering a brief summary upfront, followed by details. Your tone should remain friendly, engaging, and informative at all times.
""" * 10 # Repeat to ensure it's well over 1024 tokens
# Precise token count of the system message using tiktoken
system_tokens_precise = precise_token_count(system_instructions)
print(f"Precise token count for system instructions: {system_tokens_precise}\n")
def test_prompt_caching(dynamic_content, iteration):
# Estimate the token count of the dynamic content
dynamic_tokens_precise = precise_token_count(dynamic_content)
total_tokens_precise = system_tokens_precise + dynamic_tokens_precise
print(f"=== Iteration {iteration} ===")
print(f"User input: {dynamic_content}")
print(f"Precise total tokens: {total_tokens_precise}")
# Print system instructions for verification
print("\nSystem Instructions (trimmed):")
print(system_instructions) # Print first 500 characters of the system message for clarity
# Start timer to measure latency
start_time = time.time()
# Make the API request
completion = client.chat.completions.create(
model="gpt-4o", # Use the gpt-4o model as specified
messages=[
{
"role": "system",
"content": system_instructions # Static system message for caching
},
{
"role": "user",
"content": dynamic_content # Dynamic user input (varying part)
}
]
)
# Measure response time
response_time = time.time() - start_time
# Get usage information, including cached tokens
usage_data = completion.usage
# Check and print usage details
if hasattr(usage_data, 'prompt_tokens'):
print(f"Prompt Tokens: {usage_data.prompt_tokens}")
else:
print("Prompt tokens data not available.")
if hasattr(usage_data, 'completion_tokens'):
print(f"Completion Tokens: {usage_data.completion_tokens}")
else:
print("Completion tokens data not available.")
if hasattr(usage_data, 'prompt_tokens_details') and hasattr(usage_data.prompt_tokens_details, 'cached_tokens'):
cached_tokens = usage_data.prompt_tokens_details.cached_tokens
print(f"Cached Tokens: {cached_tokens} / {usage_data.prompt_tokens}")
else:
print("Cached tokens data not available or caching not applied.")
# Print response
print(f"Response: {completion.choices[0].message.content}")
print(f"Response Time: {response_time:.4f} seconds\n")
# Test with multiple dynamic inputs to see caching behavior
previous_messages = "Hi there! This is test number 1. " # Base user message
for i in range(2, 7):
# Increase dynamic content by keeping previous and adding new message
dynamic_content = previous_messages + f"Hi there! This is test number {i}."
# Run the caching test
test_prompt_caching(dynamic_content, i)
# Wait a bit between requests to simulate different requests
time.sleep(2)