Caching representations

Is there a way in the API to cache representations of (part of) a prompt, such that we don’t need to run GPT3 on the entire set of examples for every new input? If not, could someone look into implementing this? I would be happy to implement it myself, via giving an option to return the network representation of a string and the option to input a representation cache when generating.

This is a super common use-case, given the few-shot learning paradigm. It would save costs by 10-20x to not re-process the “training” tokens for every inference step, and it would also massively reduce the environmental cost of unnecessary computation. This would be INCREDIBLY helpful for such a small change!

3 Likes

I believe the finetune api has the answers you’re looking for.

Thanks, that’s something I considered but it would be much better to have the cached representations from language modeling with the original weights. Fine-tuning will require ~100x more labeled examples, and it looks like finetuning davinci isn’t available.

To be clear, it doesn’t require changing anything about the model it’s just a small API change for a very common use case!

1 Like

Davinci is too slow/costly to run in production imo, finetuning works with as few as 10 examples, and no caching mechanism exists to serve such a use-case. The tokens would still occupy memory on the GPU, and the cost of that has to be shared by you.

1 Like

Big bump! This would be a simple way for both OpenAI and their API users to save money and compute. Ideally we just pay for the one-time inference of a prompt/prefix as well as its storage. Then we can optionally re-input it for any future API hits. As a result, users won’t be afraid to write big prompts containing, e.g., examples of the task we want to solve.

If not, could someone look into implementing this? . . . via giving an option to return the network representation of a string and the option to input a representation cache when generating.

HuggingFace models already have an interface which enables this. As the documentation notes, the keys and values (big nested lists of floats) for each attention block in a transformer constitute the “representation” that is cached and can be re-inputted.

I used this caching for GPT-2 in my classification package and verified that it saves time (and probably quite a bit of compute) in this experiment. According to this testing module, no caching and caching give the same logits (up to 1e-4 absolute tolerance and 1e-5 relative tolerance, which are pretty standard for neural network hidden states).

One problem is that the cached prompt representation is gonna get outdated as OpenAI keeps releasing newer models. But it seems like new models are released over the course of months. So the savings should still be pretty significant even if you have to refresh the cache.

We recently built semantic cache that partially addresses this by storing OpenAI outputs in a vector db and retrieving them for semantically similar requests - though this doesn’t do cache representation.

For Q&A and RAG use cases, we generally see an average 20% cache hit rate already.