The guys at PromptMule.com have a cache-as-a-service capability that is hosted and supports OpenAI api calls, that has free trials now. Check it out
I’d say it depends on the kind of data you want to cache.
Let’s say you use the model to translate users request into something your backend understands.
Then you don’t want to store the users request on a per phrase to response logic and you also might not want to cache the data related like this:
“question of the user” : “some command for the backend - #1”
“another question of the user with the same intent” : “the same command for the backend but with #2”
but rather like this:
“question of the user” : “some command for the backend - #1”
“another question of the user with the same intent” : “same command for the backend - #1”
But for other use cases you may as well just store let’s say a similarity check for a on keyword density built matrix or even better Jaccard similarity coefficient.
It also depends on if you want to find something based on keywords, want to order the data that could be responded,… so many options and no “this solution fits all” - like hey, why don’t you just connect this vectordb or why don’t you use a hybrid of vectordb + elastic search + mongodb + rdbms + graphdb… it really depends on your data.
https://chat.openai.com/share/4959547c-42c4-421e-878b-6ec345213bcb
Wow, i am really impressed how chatgpt found my humor in that. Many people can’t without an emoji.
curious to learn more about your use case. how high throughput of a system are you building?
Thank you for the detail Jochen. I agree there is nuance to how you build the cache and use it depending on the use case the app is implementing. For my use case it is fairly simple, I’m building a demonstration of a copy writing tool that will assist writers with writing for ads, tag lines, etc. So I except there will be some repetition but that nuance is key to the design. I will consider this as I look closer at it.
Jay, this is a fairly low throughput system. I do not expect more than about 100 events per minute. Which likely translates to roughly 80-100kbps per user. (100epm/60=1.6eps * 12bytes-per-token * 32k tokens = 80kbps)