Implement conversation summary buffer

Our product is currently using the pruning method, which is just trim the old messages as the assistant api does. Not sure which method ChatGPT is using.

The problem is that summarization is dynamicly performed to keep token count within the limit. Do you have any ideas about how to design the database schema or other technologies to persist the conversation history while efficiently writes and retrieve? For example, storing the whole conversation history in mysql or pg as json involves marshalling and unmarshalling, will it be a performance concern in a high concurrency env?

Hey there!

So, vector databases do quite well in these scenarios.

Something I picked up along my own journey when working with high concurrency is simply this: enhance the read-only actions while reducing the amount write actions significantly.

Oftentimes, writing to a DB can easily end up being a FIFO situation per task. This could create some bottlenecks when you’re frequently writing to the database. Read actions though are commonly intended to be high-concurrency actions. Meaning, it’s fine when all different kinds of functions want to read the database at the same time, but it’s not fine when a bunch of functions are trying to write to the database at the same time.

1 Like

Hey, thanks for the reply. Are you suggesting that a solution is to use the vector db storing pruned conversation history and to retrieve only relevant pieces and insert into the prompt? In other words, we do not need in the prompt the summary of whole pruned conversation history.


Correct! That is the major benefit of RAG; you can “prune” anything to whatever you want, and embed them so you can retrieve the relevant chunks as represented by the embedding.

1 Like