Options for caching same prompt x thousand of requests..?

matthewbennett · June 1, 2024, 11:31am

Hi all, what are the options for caching or using the same prompt for multiple thousands of calls with slightly different contexts to the models?

Doing a categorisation project with thousands of headlines vs. set of categories with keywords. Whole list of categories + keywords is about 2,500 tokens, but the headlines are of course in the 20-70 token range.

While so many tokens are obviously good for OpenAI’s income statment, from both a coding and a general efficiency poiint of view, this seems like it could be improved upon. How do we store or cache prompts on the LLM end? Chat completions? Assistants? Is there anything there for that? Or has that not been built yet and we have to send the whole thing every time?

jr.2509 · June 1, 2024, 12:11pm

There is no caching of prompts or some such. API calls are stateless.

That said, if you are looking for cost-efficient ways for categorization tasks, then an embeddings-based approach might be a good solution for you.

To achieve that you could create embeddings of either a description of the category or example headlines with the associated category as metadata. To then classify a new headline, you would convert this headline into an vector embedding and perform a similarity search against the prepared embeddings and then assign the category associated with your closest match to the new headline.

matthewbennett · June 1, 2024, 1:30pm

Hmmm, interesting. An all-embeddings approach or a mix of embeddings and chat completions? Do we know if embeddings are particularly better or worse contextually than the different gpt models?

jr.2509 · June 1, 2024, 1:40pm

I don’t have line of sight into your data but technically, you should be able to do an all embeddings based approach. If you choose your representative embeddings data set well, then you can obtain very precise results and that at a very small fraction of the cost of a regular gpt model and with faster processing time.

matthewbennett · June 1, 2024, 2:05pm

Thank you, definitely an option to think about.

Topic		Replies	Views
Does prompt caching reduce TPM? API gpt-4o , prompt-caching	4	730	March 9, 2025
Force GPT 3.5 Turbo to choose an answer from a set of predefined options API	5	578	June 7, 2024
Is there any way to minimise the cost of a lengthy, but often-used, prompt? Prompting chatgpt , fine-tuning , chat-completion , api-optimization , api-costs	8	2526	March 8, 2024
Pseudo fine-tuning chat completions... best practices? Prompting gpt-4	3	1100	June 26, 2023
How to reduce prompt tokens price API embeddings	3	1698	April 1, 2024

Options for caching same prompt x thousand of requests..?

Related topics