Do you cache your API results?

hui.tony.zk · December 7, 2022, 3:48am

Since each request is pretty expensive and can be long latency, do you find that you commonly cache API results? If so, what do you use to cache?

Or do you find that in general every request is unique and it’s not worth it?

julien.azouz.9 · April 1, 2023, 10:39pm

I’m trying to atm since I got a 10 USD cost on a single day…
Some were repetitive so yes for me it would totally worth it.

uriv · May 26, 2023, 9:53am

I think you might be intrested in this library we wrote for this:

pip install rmmbr

from rmmbr import cloud_cache

n_called = 0

@cloud_cache(
    "https://rmmbr.net",
    "your-service-token",
    "some name for the cache",
    60 * 60 * 24, # TTL is one day.
    "your-encryption-key",
)
async def f(x: int):
  nonlocal n_called
  n_called += 1
  return x

await f(3)
await f(3)
# nCalled is 1 here

retrovrv · July 13, 2023, 5:36am

We built string-match and semantic-match cache solution for our tool recently - ⭐ Reducing LLM Costs & Latency with Semantic Cache

Even semantic cache has been quite accurate, especially in RAG or Q&A use cases, where we are seeing 20% cache hits consistently.

Foxalabs · July 13, 2023, 6:00am

So, you’re taking the LLM prompt text and doing an embedding retrieval on that from a centralised database of replies… am I wrong in making the assumption that you “could” just go an insert another LLM on the end of there and extract customers from OpenAI?

I may be getting the wrong end of the stick here, but it looks like you want to be a data arbitration layer who then returns what you deem to be a suitable answer and the end user thinks it’s from an OpenAI LLM?

retrovrv · July 13, 2023, 6:59am

We don’t control the prompts OR the outputs, and on top of that, we hash the whole message body with SHA256 and run our cache system on top of it.

If a prompt’s output has been cached, we just return that without doing anything else on our side. In case of semantic cache, we do a vector search with similarity ranking and return the output if the confidence is >95%.

What we’re doing serves as a middle layer between your app and your LLM provider and gives additional production capabilities on it like caching, but also retries, load balancing, tracing, etc.

Topic		Replies	Views
Cacheing functions listing API api	4	1070	August 21, 2023
Caching representations API	5	7264	July 13, 2023
Options for caching same prompt x thousand of requests..? API api	4	2813	June 1, 2024
Caching system prompt to facilitate interaction between user and llm API gpt-4	3	2175	September 19, 2024
Does prompt caching reduce TPM? API gpt-4o , prompt-caching	4	263	March 9, 2025

Do you cache your API results?

Related topics