There are some considerations. A) Include the entire conversation up to the token limit. This can be expensive, and needlessly so as most people will not need the language model to refer this far back in the chat history. B) Include only the last X messages in the history. Even this can be further optimized as most of this history won’t need to be referenced either.
What schema or trade offs have you come up with for balancing the amount of conversation history to send the ChatCompletions.create endpoint? Any ideas for shortening the GPT-side of the conversation?
Wow that’s smart @nunodonato! I store each message up to a character count. If by adding the new message, the total characters are exceeded, then I remove the last message until we are back below the character count.
As messages come in, convert and store their embeddings. Then for each prompt, give the most recent n characters of messages, and also embed current input and compare with old embeddings to get most similar. Fill remaining prompt space with associated stored text.