Regarding the Issue of Half-Priced Prompt Caching

hongyhbs · October 24, 2024, 9:22am

I have seven different prompt modules, each consisting of about 700 tokens. When interacting, I combine these seven modules.
The minimum combination uses 1 modules each time, meaning each interaction will exceed 1024 tokens. The combinations could be something like “1+2” + specific question, “3+5” + specific question, or “2+5” + specific question；5+ specific question. For this interaction method, I expect around 10,000 interactions. Could this kind of scenario qualify for half-priced prompt input?

sps · October 24, 2024, 11:00am

Hello @hongyhbs,

To clarify the “half-price scenario,” if you’re referring to prompt caching, it applies when the cached prefix—meaning the constant portion—reaches 1024 tokens or more. In this situation, you’ll be charged 50% for the cached tokens and the regular rate for any additional tokens.

If your use case is asynchronous, consider using the Batch API. This option will incur only 50% of the cost for the total token count per successfully completed request, regardless of size.

arata · October 24, 2024, 11:18am

You can approach this algorithmically to estimate the quantity that may effect a cache hit for a discount.

The first “algorithm” –

why would you not use the batch API, where, if you can wait overnight, the entirety of the bill is reduced to 50%, not just the input length that happens to match and hits a cache before timeout.

It seems you would not get a cache hit unless two of your “modules” are included exactly the same way, without over five minutes between for the minimum the documentation says they should persist.

To optimize the cache potential, you would want to sort and run in a short period of time:

1+2
1+2
1+3
1+3
1+3
1+4
1+5
1+5

etc.

Single modules under the minimum cache point would have no effect at all if the question varies, but still could be sorted and performed outside of this.

hongyhbs · October 24, 2024, 11:22pm

Thank you for your reply. This system is a complex multi-turn interaction system. To get the final required information for a single query, it takes 9 rounds of interaction. This current round is the 7th, and there are still 2 rounds left that depend on the information returned from the 7th round. In this situation, I’m not sure how to implement an asynchronous approach.
I don’t have any programming training. I describes the logic and AI helps me complete the entire system. I must say that the emergence of AI is a great thing. Thanks"

hongyhbs · October 24, 2024, 11:23pm

Thank you for your suggestion; I think it’s a viable option under the current mechanism, though if GPT doesn’t recognize 3+4 or 3+5, I won’t get the half-price if 3 exceeds 1024 tokens. In reality, my prompts consist of three parts: N1+N2 (uncertain but combined will definitely exceed 1024 tokens) + specific question (uncertain) + reasoning process guidance (1100 tokens).

arata · October 25, 2024, 3:21am

Anything that is in common will be used, in steps of 128 tokens after the initial 1024.

So if your “module 3” is 1111 tokens, let’s say, 3+anything will have 1024 token cache hit from the previous recent usage of 3+anything.

Topic		Replies	Views
Using fixed prompt templates with over 4000+ tokens did not trigger the half-price caching discount API chatgpt	5	100	October 27, 2024
Prompt Token Cache Gaming to Save Money? API prompt-caching	1	513	October 18, 2024
How Prompt caching works? API assistants-api , prompt-caching	17	5044	February 4, 2025
How does Prompt Caching work? Prompting api , prompt-caching	8	2523	October 29, 2024
Batch API vs Prompt caching API batch-api , prompt-caching	1	566	October 14, 2024

Regarding the Issue of Half-Priced Prompt Caching

Related topics