Regarding the Issue of Half-Priced Prompt Caching

I have seven different prompt modules, each consisting of about 700 tokens. When interacting, I combine these seven modules.
The minimum combination uses 1 modules each time, meaning each interaction will exceed 1024 tokens. The combinations could be something like “1+2” + specific question, “3+5” + specific question, or “2+5” + specific question;5+ specific question. For this interaction method, I expect around 10,000 interactions. Could this kind of scenario qualify for half-priced prompt input?

Hello @hongyhbs,

To clarify the “half-price scenario,” if you’re referring to prompt caching, it applies when the cached prefix—meaning the constant portion—reaches 1024 tokens or more. In this situation, you’ll be charged 50% for the cached tokens and the regular rate for any additional tokens.

If your use case is asynchronous, consider using the Batch API. This option will incur only 50% of the cost for the total token count per successfully completed request, regardless of size.

1 Like

You can approach this algorithmically to estimate the quantity that may effect a cache hit for a discount.

The first “algorithm” –

  • why would you not use the batch API, where, if you can wait overnight, the entirety of the bill is reduced to 50%, not just the input length that happens to match and hits a cache before timeout.

It seems you would not get a cache hit unless two of your “modules” are included exactly the same way, without over five minutes between for the minimum the documentation says they should persist.

To optimize the cache potential, you would want to sort and run in a short period of time:

1+2
1+2
1+3
1+3
1+3
1+4
1+5
1+5

etc.

Single modules under the minimum cache point would have no effect at all if the question varies, but still could be sorted and performed outside of this.

1 Like

Thank you for your reply. This system is a complex multi-turn interaction system. To get the final required information for a single query, it takes 9 rounds of interaction. This current round is the 7th, and there are still 2 rounds left that depend on the information returned from the 7th round. In this situation, I’m not sure how to implement an asynchronous approach.
I don’t have any programming training. I describes the logic and AI helps me complete the entire system. I must say that the emergence of AI is a great thing. Thanks"

Thank you for your suggestion; I think it’s a viable option under the current mechanism, though if GPT doesn’t recognize 3+4 or 3+5, I won’t get the half-price if 3 exceeds 1024 tokens. In reality, my prompts consist of three parts: N1+N2 (uncertain but combined will definitely exceed 1024 tokens) + specific question (uncertain) + reasoning process guidance (1100 tokens).

Anything that is in common will be used, in steps of 128 tokens after the initial 1024.

So if your “module 3” is 1111 tokens, let’s say, 3+anything will have 1024 token cache hit from the previous recent usage of 3+anything.

1 Like