Efficient stateful completion chatbot

In order to build a chatbot, the only recommended approach (as far as I can tell) is to feed the entire conversation history into the Completion API prompt.

However, this means for a chat with N tokens, I will be submitting ~N^2 tokens over the life of the chat as the history gets duplicated in each request (though this is limited by the size of the context window).

Is there any way to avoid this issue, i.e. by having stateful conversations?


To estimate worst-case cost, in a hypothetical long-lived chat with 1M tokens using Davinci with the full 4000-token context window, with 10 tokens/chat, at $0.02/1K tokens, instead of costing

(1M tokens) * ($0.02 / 1000 tokens) = $20

It would cost

(100K requests) * (4K tokens/request) * ($0.02 / 1K tokens) = $8000

Which is a 400x increase in cost.

2 Likes

You don’t need to put the whole context in the conversation each time a request/message is sent. You can allow the user to toggle it or control it yourself.

I call it remembrance level, it just tells how many chat pairs to store in memory for efficient chat conversations, say 3-10 is a good number. Use the FIFO principle to do that.

Thanks @Xerxes - but isn’t that just tossing out the rest of the chat history? To use your terminology, I’m basically asking about the use case where you want to maximize remembrance level, so remembrance level ~= model context window.

To have a full conversation, where GPT3 follows the context and doesn’t miss details, you can go for 8-16 Remembrance level.

That means it potentially keeps the context to 32 total messages (max), which is a lot. Now you obviously don’t want the bill you mentioned above. You also need to keep in mind the 4000 token limits.

Ideally, keep it 10. Every time a new message is sent by the user, flush the oldest message pair and add the new one in place, hence FIFO.

You never would want to have a similar working as on the playground, token max out problem. If you don’t want to use R level, then go by taking the tokens in context chat into consideration. If it reaches near the 4000 limits flush the chats.

But this would cost a lot more. I’ve built a chatbot as a service, being used by a lot of people. And by using the R level, this problem is solved. Moreover, it can be toggled by the user. So there’s flexibility.

1 Like

I haven’t done this with chat bots necessarily, but I’ve run into this problem with some production apps I’ve built. A solution that works pretty well is to keep a running “summary” of that chat history in addition to the last N messages.

The summary essentially just keeps key pieces of information around, since most of the chat is just conversational and not worth keeping anyways and a waste of tokens.

2 Likes

If you want to go really fancy, only feed back the last few messages, but do some sort of cheaper search of all the message history to see which messages might be relevant to the current one.
For instance you could feed in the most recent three messages in addition to the three past messages with the highest cosine similarity to the current one.

How do get that summary? Via GPT3 probably?
If so what prompts should I use to get details that are worth keeping.

After, say every 5 interchanges, you submit a prompt saying “summarize this chat with the most relevant information: {the chat}” or something similar. Then just keep rolling those summaries in with each other as the chat goes on. If the chat gets too long you can make it summarize the last x summaries, etc. Just have to make it keep identifying the critical parts of the chat and keep only the relevant information alive.