Caching representations

zziegler · September 21, 2021, 10:01pm

Is there a way in the API to cache representations of (part of) a prompt, such that we don’t need to run GPT3 on the entire set of examples for every new input? If not, could someone look into implementing this? I would be happy to implement it myself, via giving an option to return the network representation of a string and the option to input a representation cache when generating.

This is a super common use-case, given the few-shot learning paradigm. It would save costs by 10-20x to not re-process the “training” tokens for every inference step, and it would also massively reduce the environmental cost of unnecessary computation. This would be INCREDIBLY helpful for such a small change!

asabet · September 24, 2021, 2:49pm

I believe the finetune api has the answers you’re looking for.

zziegler · September 24, 2021, 8:17pm

Thanks, that’s something I considered but it would be much better to have the cached representations from language modeling with the original weights. Fine-tuning will require ~100x more labeled examples, and it looks like finetuning davinci isn’t available.

To be clear, it doesn’t require changing anything about the model it’s just a small API change for a very common use case!

asabet · September 25, 2021, 4:06am

Davinci is too slow/costly to run in production imo, finetuning works with as few as 10 examples, and no caching mechanism exists to serve such a use-case. The tokens would still occupy memory on the GPU, and the cost of that has to be shared by you.

chicxulub · April 14, 2023, 6:13am

Big bump! This would be a simple way for both OpenAI and their API users to save money and compute. Ideally we just pay for the one-time inference of a prompt/prefix as well as its storage. Then we can optionally re-input it for any future API hits. As a result, users won’t be afraid to write big prompts containing, e.g., examples of the task we want to solve.

If not, could someone look into implementing this? . . . via giving an option to return the network representation of a string and the option to input a representation cache when generating.

HuggingFace models already have an interface which enables this. As the documentation notes, the keys and values (big nested lists of floats) for each attention block in a transformer constitute the “representation” that is cached and can be re-inputted.

I used this caching for GPT-2 in my classification package and verified that it saves time (and probably quite a bit of compute) in this experiment. According to this testing module, no caching and caching give the same logits (up to 1e-4 absolute tolerance and 1e-5 relative tolerance, which are pretty standard for neural network hidden states).

One problem is that the cached prompt representation is gonna get outdated as OpenAI keeps releasing newer models. But it seems like new models are released over the course of months. So the savings should still be pretty significant even if you have to refresh the cache.

retrovrv · July 13, 2023, 5:23am

We recently built semantic cache that partially addresses this by storing OpenAI outputs in a vector db and retrieving them for semantically similar requests - though this doesn’t do cache representation.

For Q&A and RAG use cases, we generally see an average 20% cache hit rate already.

Topic		Replies	Views
Do you cache your API results? API	5	3661	July 13, 2023
Options for caching same prompt x thousand of requests..? API api	4	2847	June 1, 2024
Caching system prompt to facilitate interaction between user and llm API gpt-4	3	2235	September 19, 2024
Fine tuning the model for our specific use case? API	4	1018	December 27, 2023
How to cache LLM responses in Langchain recent versions for OpenAI GPT4 Community gpt-4 , api	1	1106	March 2, 2024

Caching representations

Related topics