Do you cache your API results?

Since each request is pretty expensive and can be long latency, do you find that you commonly cache API results? If so, what do you use to cache?

Or do you find that in general every request is unique and it’s not worth it?

2 Likes

I’m trying to atm since I got a 10 USD cost on a single day…
Some were repetitive so yes for me it would totally worth it.

1 Like

I think you might be intrested in this library we wrote for this:

pip install rmmbr
from rmmbr import cloud_cache

n_called = 0

@cloud_cache(
    "https://rmmbr.net",
    "your-service-token",
    "some name for the cache",
    60 * 60 * 24, # TTL is one day.
    "your-encryption-key",
)
async def f(x: int):
  nonlocal n_called
  n_called += 1
  return x

await f(3)
await f(3)
# nCalled is 1 here
1 Like

We built string-match and semantic-match cache solution for our tool recently - ⭐ Reducing LLM Costs & Latency with Semantic Cache

Even semantic cache has been quite accurate, especially in RAG or Q&A use cases, where we are seeing 20% cache hits consistently.

So, you’re taking the LLM prompt text and doing an embedding retrieval on that from a centralised database of replies… am I wrong in making the assumption that you “could” just go an insert another LLM on the end of there and extract customers from OpenAI?

I may be getting the wrong end of the stick here, but it looks like you want to be a data arbitration layer who then returns what you deem to be a suitable answer and the end user thinks it’s from an OpenAI LLM?

We don’t control the prompts OR the outputs, and on top of that, we hash the whole message body with SHA256 and run our cache system on top of it.

If a prompt’s output has been cached, we just return that without doing anything else on our side. In case of semantic cache, we do a vector search with similarity ranking and return the output if the confidence is >95%.

What we’re doing serves as a middle layer between your app and your LLM provider and gives additional production capabilities on it like caching, but also retries, load balancing, tracing, etc.