Managing Context in a Conversation Bot with Fixed Token Limits

oscillonchronos · January 16, 2025, 8:36pm

Hello everyone,

I’m currently working on building a conversation bot and would love your insights on how to manage context effectively. As we all know, LLMs have a fixed context limit, and simply dumping all past conversations into each request isn’t sustainable since it will eventually exceed the token limit. From my understanding, OpenAI’s API removes messages starting from the top, ensuring that recent exchanges are preserved, when the limit is reached. But it isn’t always ideal.

As a beginner, I’m exploring different ways to handle context more efficiently and would greatly appreciate guidance on this topic. Here’s the approach I’ve come up with so far:

Initial Approach – Dumping Everything:

Start by sending all past conversations with each request. This works fine at first but will obviously hit the token limit as the conversation grows.

Threshold-Based Summarization:

When the token count approaches the limit, take the entire conversation history and shorten it. Send this shorten context along with new messages. Repeat this every time the threshold is reached, applying the same logic to the updated history.

Using a Secondary LLM for Refinement:

When the conversation becomes too long for just condensing to fit within the limit, pass the entire context to separate instance of GPT-4o-mini. This instance would refine the context by identifying and removing relatively unimportant parts & creating a more concise summary, which is then sent back and used for further interactions.

Ongoing Refinement as Needed:

If even this refined process is too no longer sufficient. We would instruct now GPT-4o-mini instance to now remove stuff from conversation history from top or I let OpenAI truncate it.

I realize there are probably more refined strategies or tools to handle context better, especially with complex conversational requirements. I’d love to hear about other methods, tools, or frameworks you use to manage this challenge effectively. As someone new to this field, I’m open to all suggestions.

Thank you in advance for sharing your expertise!

Disclaimer: This post was edited by Generative model for clarity and grammar.

Regards
Prabhas Kumar

Macha · January 16, 2025, 9:13pm

Hey there and welcome to the community!

Have you heard of RAG, or retrieval-augmented generation? I know many of us store different kinds of data in either vector or graph databases. That approach might be what you’re looking for here.

Your option 3 there is in a sense, how OAI’s memory feature works. It will summarize something, add it to a list in a database, and then the contents of that list gets dumped for further context or retrieval as-needed.

oscillonchronos · January 16, 2025, 11:25pm

I have heard about RAG. I know we can use embedding models to convert text to vector representation and this can help with static contents like help documents etc etc. However I’m wondering is it efficient to convert conversation history to vector and feed that to LLM?

Topic		Replies	Views
Maintaining Context in Long-Running GPT-4o API Conversations for Executive Desktop Application? API gpt-4 , chatgpt , api	0	118	April 29, 2025
Seeking guidance on managing long conversations and token limits while implementing ChatGPT in a mobile app for a design application API	6	2473	November 15, 2023
Managing longer conversations with GPT API API	4	8379	December 15, 2023
Retrieving Past Conversations To Continue Via RAG Model API gpt-4 , api	4	2432	April 22, 2024
Maintain the context within the 4096 max tokens API	2	2353	February 16, 2024

Managing Context in a Conversation Bot with Fixed Token Limits

Related topics