Exceeding token limit while maintaining context

I am using OpenAI’s API to use gpt 3.5 as a chatbot via Python. Because I am making API calls, I understand that I need to maintain context and to do that, I need to pass the history of interactions each time to maintain context.

I do that by declaring an empty list.

prompt_messages = []

I pass in a system message, as well as prompt templates in the beginning. This is what it looks like:

# Function to call the OpenAI API
def process_gpt(file_prompt, system_msg):

    # Creates a list of dicts, declaring system message and file prompt
    prompt_messages.append({"role": "system", "content": system_msg})
    prompt_messages.append({"role": "user", "content": file_prompt})

    # Sending the list of dicts to chatgpt
    completion = client.chat.completions.create(model="gpt-3.5-turbo", messages=prompt_messages)

    # Getting the result
    nlp_results = completion.choices[0].message.content
    nlp_dict = {"role": "assistant", "content": nlp_results} # Turning the result into a dict

    # appending the result to the list of dicts
    prompt_messages.append(nlp_dict)

    sleep(8)
    return nlp_results

And to interact with the chatbot, I use this function, which appends to the list at each input and resubmits the entire list:

def bot(prompt):
  prompt_entry = {"role": "user", "content": prompt} # Turning out question to a dict
  prompt_messages.append(prompt_entry) # appending the question to our list of dicts

  # sending to chatgpt
  completion = client.chat.completions.create(model="gpt-3.5-turbo", messages=prompt_messages)

  # getting the answer from the model
  nlp_results = completion.choices[0].message.content
  print(nlp_results)

  # converting the answer to a dict
  nlp_dict = {"role": "assistant", "content": nlp_results}
  prompt_messages.append(nlp_dict) # appending to list of dicts

The problem here is that, as the chat continues, this grows larger and larger and at some point, I simply exceed the token limit.

This is obviously not how ChatGPT does it. I feel like my approach is highly inefficient but where am I going wrong?

1 Like

Did you ever figure this out? I would like to know too.

An empty list was declared:

prompt_messages = []

And then when invoking the method process_gpt() to send to the API initially (perhaps to make it say hello automatically), two list items were added, the equivalent of:

prompt_messages = [
  {"role": "system", "content": system_msg},
  {"role": "user", "content": file_prompt}
]

and then another message is added to the list, with the AI response, giving the list a total:

prompt_messages = [
  {"role": "system", "content": system_msg},
  {"role": "user", "content": file_prompt},
  {"role": "assistant", "content": nlp_results}
]

Meaning that we can use list indexes to access them, showing in shell:

>>>prompt_messages[0]
{"role": "system", "content": system_msg}

after employing a different method bot() to basically send in the same manner, another user message and AI result is added:

prompt_messages = [
  {"role": "system", "content": system_msg},
  {"role": "user", "content": file_prompt},
  {"role": "assistant", "content": nlp_results}
  {"role": "user", "content": prompt},
  {"role": "assistant", "content": nlp_results}
]

and so on.

That’s dandy an all, but what if we want to cut off some old messages? Then the system message disappears also, unless we first remove or copy it, or do some significant calculations. Or what if a user input failed to get a result? The method already added it as having happened.


Therefore it makes a lot more sense to have the persistent system prompt separate from a history of messages, and a history separate from the user input that is provisional. Then a single API accessing method can be used.

Let’s define the system message globally and name the past chat better than “prompt messages” as it also contains replies, and even function calls:

import openai  # use the whole library, for errortypes, types etc

system = [{"role": "system", "content": 
    "You are ChatExpert, a large language model AI assistant"}]
chat = []

Now how about a single function that can take a user input, instead of two functions that are almost the same. You will notice that where I send the messages below, I can now distinctly merge the separate lists containing the system message, the past chat where I use a list range chat[-10:] (with a start position measured from the end to limit the number of past turns), and then the newest input.

def chat_with_ai(user_input):
    user = [{"role": "user", "content": user_input}]
    client = openai.Client()
    try:
        response = client.chat.completions.with_raw_response.create(
            messages=system + chat[-10:] + user,  # chat is limited
            model="gpt-3.5-turbo", max_tokens=256,
            temperature=0.5, top_p=0.5, stream=True)
        reply = ""
        for chunk in response.parse():
            if not chunk.choices[0].finish_reason:
                word = chunk.choices[0].delta.content or ""
                reply += word
                print(word, end ="")
                # here you'd collect tool or function call chunks
        chat.append({"role": "user", "content": user_input})
        chat.append({"role": "assistant", "content": reply})
        return True
    except Exception as e:
        print(f"\nAn error occurred: {e}")

So we have a simple solution that maintains all the chat, but only sends some of it if too long. You could change the length sent at any time. But, it doesn’t check if the messages add up to more tokens than the AI model can handle.

Streaming is used. Raw response gets headers. Where it prints, that could be output to a UI or client app.

Adding the new user input and assistant output to a chat history is only done upon success. Let’s use the success status return creatively in our main chatbot loop (not pictured in the original post) so you can even retry the request.

first_success = chat_with_ai("Welcome the user to the chatbot")
last_input = ""
while True:
    user_input = input("\nPrompt: ")
    if user_input == "exit":
        break
    if user_input == "Y" and last_input:
        user_input = last_input
    success = chat_with_ai(user_input)
    if not success:
        print("Enter only 'Y' to resend, or type a new prompt")
    else:
        last_input = user_input

The original poster wanted to AI to say something first, so a single line does that - or errors without typing.

If there were a user manual, it would tell you that typing exit as your prompt will exit the loop and the program, and that entering only upper-case Y will send the same thing again any time (and the AI sees you sending the same thing again only on prior success).


Improvements for you (since this isn’t the “learn to code” forum.)

Where the chat[-10:] is now, a function call chat_to_send(chat, budget, turns) would allow better or changing chat limit settings to be employed each turn.

You can have a much smarter function that prepares the chat amount you send from the total chat history.

  • It can count the tokens of each message with the tiktoken library and account for overhead of 4 extra tokens each, storing that metadata along with the message.
  • It can have a token budget of the maximum tokens you want to send, also considering the user input and system message and any reservation for the response, adding chat history turns further back until the budget would be exceeded.
  • It can ensure pairings of user input to outputs in tailing the chat.
1 Like

I construct the system/user/assistant messages as normal, but before sending to model, use tiktoken (pip install tiktoken) to count the total tokens, if the token count exceeds X amount, then pop the first pair of user/assistant messages, repeating until the token limit is no longer exceeded.

I could be wrong but token limit provided by OpenAI includes both inputs and outputs, so I leave a window for output tokens of around 1500-2000 since that is the maximum token output I typically expect with how I use my chatbot.

So if I’m working with gpt-4, I set the context window at around 6000 (token limit for gpt-4 is 8192), leaving 2192 tokens for output. Meaning, if the preprocessed context is at 10000 tokens, the first user/assistant pair will pop, and continues on until that 10000 tokens is cut down to less than 6000.

I experimented with various things like always keeping the first pair of user/assistant messages (to give the model the “best” context), as well as providing truncation messages like “this part of the chat history has been cut to save space”, and have thought of other methods like using gpt-3.5 or 4-turbo to summarize user/assistant messages, but in the end I just stuck with my original method of popping the first/latest pair of user and assistant messages.

1 Like