We have been using the open-ai gpt-4 model to generate structured lesson plans based on the text data passed as prompt tokens for a week now. Now every time we are passing the huge prompt for the engine to give us the required information. This is consuming a lot of tokens each time and essentially burning up the limit of tokens per request. Even the response time is more than 5 minutes. Is there any way I can cache the first prompt properties, like the big text data, and tell open-ai to use that particular cache to form the lesson plans? This is to shorten the request time and lower the token consumption to retrieve more completion tokens per day.
No, the model is stateless, so it requires all of the information to be passed each time. Currently no way around that.
Hi @Foxabilo ,
Thanks for the prompt response. Is there any other model that supports our requirement? Kindly guide us on optimising our token usage and the response times.
You are on the right path if you are managing your own conversation history.
However, there is no avoiding giving the AI its (permanent system) instructions each API call (except very clever techniques - where several quality chat turns also trains the AI a bit, and you can attempt to substitute less instruction)
There is also no repeated answering from data without the AI seeing all data again.
Preface: don’t use “assistants” if you want any hope of controlling costs
AI summarization and document preparation may help you reduce the actual token count that persists each call.
Then, if you “chat”:
First: count the tokens of each message and response. Adding this metadata is extremely useful for managing and calculating costs.
Then consider how many past turns is actually necessary for the application.
If you need extended chat for the illusion of long memory, occasional summary by second AI, or even a semantic database to retrieve old chat turns, can be employed.
You can reduce either the number of user inputs or the number of assistant outputs. For chat, what the user has been typing might be more necessary for understanding the topic than long AI responses. For coding, this may be reversed.
The transmission of data and the loading of input doesn’t take that much time. An AI can start responding in stream mode within under 2 seconds. However, this input processing to prepare the AI for answering also takes computation resources each time. You don’t get assigned one AI model of your own.
Currently there are no large language models with zero cost memory as you have described. They all require the full set of instructions to be included every time.
I am curious, how are you using the API to create lesson plan? I am expecting you just feed it data of the topics and some specialized system prompt and it generates the lesson plan without needing any additional context. So like this is one API call. I am guessing the data for the topics is what is consuming the most tokens. But is it?
Hello all, sorry for the late response,
As per my own tests, subsequent request has indeed provided relevant response the second time without the full content request. However this is not yet been assured and hence I cannot give you a confirmed answer. Thanks however for your responses.
@supershaneski the topic information is what is consuming tokens. It needs to be sent for every request. But as I have said above, the subsequent request is still providing relevant response without the topic content attached with the request tokens. But however I have been lately getting the response in 2-5 min range according to the context of the request and not the content of the request. I am opening another thread for this topic however.