There are some considerations. A) Include the entire conversation up to the token limit. This can be expensive, and needlessly so as most people will not need the language model to refer this far back in the chat history. B) Include only the last X messages in the history. Even this can be further optimized as most of this history won’t need to be referenced either.
What schema or trade offs have you come up with for balancing the amount of conversation history to send the ChatCompletions.create endpoint? Any ideas for shortening the GPT-side of the conversation?
At first thought, I think embeddings can help?
make a summary of the conversation every X messages, include the summary in the prompt
Wow that’s smart @nunodonato! I store each message up to a character count. If by adding the new message, the total characters are exceeded, then I remove the last message until we are back below the character count.
yeah you can do that, but you may be losing important details of the conversation. It really depends on the purpose of the conversation though.
also, keeping the prompt at its max size sounds like a good idea to keep paying as much as possible
As messages come in, convert and store their embeddings. Then for each prompt, give the most recent n characters of messages, and also embed current input and compare with old embeddings to get most similar. Fill remaining prompt space with associated stored text.
Can someone expand this idea please, step by step - why exactly are embeddings helpful here?
in Infinite Memory Chatgpt - a Hugging Face Space by differentai
i store messages as embeddings and build the history message according to similar previous messages plus 3 last messages
a production scenario would count tokens to optimally fit the prompt to the maximum size
Curious how did the various methods mentioned about turn out to everyone? We are trying to solve this too