Caching representations

Is there a way in the API to cache representations of (part of) a prompt, such that we don’t need to run GPT3 on the entire set of examples for every new input? If not, could someone look into implementing this? I would be happy to implement it myself, via giving an option to return the network representation of a string and the option to input a representation cache when generating.

This is a super common use-case, given the few-shot learning paradigm. It would save costs by 10-20x to not re-process the “training” tokens for every inference step, and it would also massively reduce the environmental cost of unnecessary computation. This would be INCREDIBLY helpful for such a small change!

I believe the finetune api has the answers you’re looking for.

Thanks, that’s something I considered but it would be much better to have the cached representations from language modeling with the original weights. Fine-tuning will require ~100x more labeled examples, and it looks like finetuning davinci isn’t available.

To be clear, it doesn’t require changing anything about the model it’s just a small API change for a very common use case!

Davinci is too slow/costly to run in production imo, finetuning works with as few as 10 examples, and no caching mechanism exists to serve such a use-case. The tokens would still occupy memory on the GPU, and the cost of that has to be shared by you.