I have built a virtual assistant chatbot for casual conversation. Every time the user utters a statement, it is appended to the prompt and the GPT-3 API is asked the provide the bot’s response.
However, as the conversation progresses, the cost of generating per statement increases in O(n^2). This method is not viable for longer conversations, say 5 minutes. I can think of shortening the conversation by summarizing the history. But I am wondering if there is a native way to do that. Because with every utterance, I am passing the same prompt GPT-3 has already seen. It could have been more efficient to save that embedding and restart with adding only the new text.
I’m no dev, but I do agree that, since embedding is so much more efficient, somehow caching and embedding the history might be a way to save resources?
Also, you might get better responses to posts like this in the #feedback category
From my experience thus far, creating a chatbot via the API with fine tuning and embeddings is a bit of a hill. For a start you only have, at best, the Davinci base model to work with, can’t add-on-to the 003 model (which is the base model but fine tuned with gahoomabytes of info).
For my application I am in the same boat, the best approach Ive found is to kind of summarise the conversation on my side rather than repeat it verbatim. A lengthy thread where John wants to know the price of milk I summarise in a one-shot prompt. You can achieve this by either coding it yourself or by using a side application of AI and asking it that summarise this chat in n’th sentences. Then pass the summary as a one-shot prompt. Is it perfect, no. Does it save on tokens, yes.
Are you adding the summary in the system message? If not, how are you providing it and how are you finding that working in terms of the AI following the conversation?