Options for caching same prompt x thousand of requests..?

Hi all, what are the options for caching or using the same prompt for multiple thousands of calls with slightly different contexts to the models?

Doing a categorisation project with thousands of headlines vs. set of categories with keywords. Whole list of categories + keywords is about 2,500 tokens, but the headlines are of course in the 20-70 token range.

While so many tokens are obviously good for OpenAI’s income statment, from both a coding and a general efficiency poiint of view, this seems like it could be improved upon. How do we store or cache prompts on the LLM end? Chat completions? Assistants? Is there anything there for that? Or has that not been built yet and we have to send the whole thing every time?

There is no caching of prompts or some such. API calls are stateless.

That said, if you are looking for cost-efficient ways for categorization tasks, then an embeddings-based approach might be a good solution for you.

To achieve that you could create embeddings of either a description of the category or example headlines with the associated category as metadata. To then classify a new headline, you would convert this headline into an vector embedding and perform a similarity search against the prepared embeddings and then assign the category associated with your closest match to the new headline.


Hmmm, interesting. An all-embeddings approach or a mix of embeddings and chat completions? Do we know if embeddings are particularly better or worse contextually than the different gpt models?

I don’t have line of sight into your data but technically, you should be able to do an all embeddings based approach. If you choose your representative embeddings data set well, then you can obtain very precise results and that at a very small fraction of the cost of a regular gpt model and with faster processing time.


Thank you, definitely an option to think about.

1 Like