How to eliminate useless content in the Assistant response

I have written a GPT Assistant app. It works well for the first cut. Now I’m dealing with cost issues. One thing I want to do is eliminate useless content in the Assistant response. I need only the last line in the messages array. However, it seems like the Assistant sends with this last line all of the lines from previous prompts and responses occurring since the creation of the Assistant. To deal with this costly problem I am using the truncation_strategy paramater in the run object. Here is the code:

run = client.beta.threads.runs.create(,,
    "type": "last_messages",
        "last_messages": 55,

I would like to use “last_messages”: 1" in the code but when I do it the Assistant hallucinates wildly. Is there a way to prevent this while using a small number for the last_messages constant?

How do you expect the assistant to be coherent if you are refusing to give it any information?

Imagine you and I are having a conversation, via text but no matter how much we type, the other person only sees the last sentence we wrote.

The conversation would quickly go off the rails.

"Imagine you and I are having a conversation, via text but no matter how much we type, the other person only sees the last sentence we wrote.

The conversation would quickly go off the rails."

Great way to state the problem! But it’s the bot who needs context, so why is the bot sending the context to me when the bot already has it?

Am I being charged for all the tokens sent to me or for only the tokens in new messages sent to me?

You are charged as input tokens for any context sent to the model and output tokens for anything new the model generates.

Does the truncation_strategy parameter limit the context sent to the GPT Assistant?

I keep number of last messages as lagre as possible

If all the messages to and from the GPT Assistant API are kept on the openai server, why do they have to be sent to me with every message from the GPT Assistant API in order to establish context for messages from me?

I’m not certain what your specific concern is, is it an issue of needing to maintain the context yourself and repeatedly transmitting that context or is it an issue of being billed for the context tokens each time they are sent?

The answer to both is basically, because that’s just how it works.

The models are stateless, so if the context isn’t there the model doesn’t know about it. They send it back to you to facilitate the conversational, back-and-forth nature of chat completions and because they decided that was the best procedure for the Assistant platform.

It doesn’t cost you anything to get the context delivered back to you, that’s just static text. But, if you don’t send all the context back, the model won’t know what you’re talking about and will hallucinate constantly.

You do get charged for all the context you send back because, again, the models are stateless. Every time the model wants to generate another token, it needs to process every token which came before. That’s why cost and time to use LLMs has, historically, scaled quadratically with context length, e.g. 10x the context = 100x the number of computations.

1 Like

Why should we get charged, since that context material is already on the open AI server? We know it’s up there because, when the user retrieves the assistant and the thread, the assistant begins the discussion with awareness of past discussions.

Because every time you want to generate a new token the model must perform computations on every token which came before the next one you want to generate.

The computational complexity of a transformer-based LLM scales with context-length-squared.

Regardless of where the context is at the start of a message generation, the computation still must be done, and that is what you are paying for—computation on tokens not transmission of tokens.