Reducing Context Tokens in Assistant Threads

yako · December 1, 2023, 3:24pm

I hope this message finds you well. I am reaching out for guidance regarding an issue I’ve encountered with the management of context tokens in assistant threads.

I have observed that the “context tokens” count in my assistant threads is significantly high, leading to increased associated costs. This seems to be due to the accumulation of older, unnecessary messages in the threads.

Is there a way to selectively delete old messages from these threads? My goal is to remove redundant or irrelevant content to streamline the conversation and reduce the context tokens being processed repeatedly.

Alternatives Sought: Additionally, I am open to any alternative suggestions or best practices that could help in managing context tokens more efficiently. If there are strategies or tools within (or outside) the OpenAI framework that could aid in this, I would greatly appreciate learning about them.

Thank you for your time!

fra_ab · December 1, 2023, 3:56pm

anon10827405 · December 1, 2023, 4:24pm

They recently updated the documentation indicating that they are “exploring” the concept of not having a potential of >$1/message.

Currently, the Assistant will include the maximum number of messages that fit in the context length. We plan to explore the ability for you to control the input / output token count beyond the model you select, as well as the ability to automatically generate summaries of the previous messages and pass that as context. If your use case requires a more advanced level of control, you can manually generate summaries and control context with the Chat Completion API.

https://platform.openai.com/docs/assistants/how-it-works/context-window-management

Why they introduced this alongside a gpt-4 with a 128k context window will never make sense to me.

Alternatively, if you like the Assistants framework (which to me is conceptually amazing) you can start with the gpt-4 and then “downgrade” to a gpt-3.5 model, effectively reducing your max token count. I can’t remember, may be off but I believe at 16k the max is $0.01 or $0.02 per message and is plenty enough context to carry a conversation.

For my use-case I use GPT-4 initially to set a bunch of user variables (via function calling and instructions). Once they’re noticed, it’s switched to GPT-3.5. It took a little bit of adjusting for the instructions but it’s not even noticeable.

logankilpatrick · December 1, 2023, 4:30pm

Hey @yako right now, we don’t offer an option to do this, the assistant will try to keep as many messages in context as it can and then just drop old messages as new ones are added and the context window fills us.

The context on why this is the starting point is that ChatGPT uses a similar heuristic. We know this is not perfect and are exploring other options, many folks have asked for this so stay tuned!

anon10827405 · December 1, 2023, 4:45pm

Thanks for the insight. Looking forward to the future of Assistants.

derrickob · December 2, 2023, 10:16am

Hello @logankilpatrick , you might consider looking at the community-maintained Openai-php/client Git Lib: GitHub - openai-php/client: ⚡️ OpenAI PHP is a supercharged community-maintained PHP API client that allows you to interact with OpenAI API.. It has the feature to delete a message in a thread by passing the ThreadId and MessageId while acknowledging missing DOC on that feature. Surprisingly it works but what happens is it’s not able to delete tool outputs sent to the Assistants API. You may consider looking into that as well.
Thank you

willer · December 6, 2023, 4:05pm

Good call. Looks like there’s an undocumented/beta API endpoint.

# delete a message
def delete_message_gpt(thread_id, message_id):
    # have to hit this one directly, as it's unpublished/beta
    url = f"https://api.openai.com/v1/threads/{thread_id}/messages/{message_id}"
    headers = {
        "Authorization": f"Bearer {app.config['OPENAI_API_KEY']}",
        "Content-Type": "application/json",
        "OpenAI-Beta": "assistants=v1",
    }

    response = requests.delete(url, headers=headers)

    if response.status_code != 200:
        raise Exception(f"Failed to delete message: {response.status_code} {response.text}")

Worth noting, the ability to delete messages from a thread is a useful user feature in ChatGPT. So it makes sense to have an endpoint so you can delete user messages and retry, or delete assistant messages and do a new run.

sasindujayashmaavmu · December 13, 2023, 8:57pm

Thanks, looking forward to this update since my context tokens are crazy higher.

andrebadini · December 26, 2023, 3:06pm

Hello,

I noticed that you mentioned a beta/undocumented endpoint for deleting messages in a ChatGPT thread of assistant api. Have you had a chance to test this endpoint? If so, did it work as expected?

willer · December 26, 2023, 3:08pm

Hi … it does work, but gave me some unexpected results like responding to messages I thought were deleted. I didn’t spend much more time on making it work, as I didn’t want to have a production system depending on an undocumented endpoint.

My suspicion is that if you delete a message in the middle of the thread, you have to make sure to delete from the end of the thread all the way back to the message you want to delete, in the proper order so there are no orphan message nodes. I just never got around to trying that.

derrickob · December 26, 2023, 6:15pm

It seems to work but leaves the function calls and their outputs including code_interpreter responses. However much you try to delete them with their IDs nothing will happen

steve35 · January 16, 2024, 3:22am

Has anyone else tried this undocumented beta endpoint to delete messages because I want to give it a try so I can do anything to reduce my Contacts tokens

shamakrus · January 16, 2024, 6:39am

Hi @yako Have you tried creating a new thread for each new question/message addressed to the assistant?

@logankilpatrick could you please advise is there any limitations on the number of threads per assistant?

I am planning to use that approach (new message-new thread). Additionally it will help to send simultaneously as many requests/messages as I need, it’s no need to wait until the run will be completed. In my case, I have long commands and short messages/questions that are unrelated to each other. To reduce costs, I provide my commands to the assistant once and then ask the assistant to analyze the user’s message.

By the way, thank you for your question. It helped me understand that each run of the assistant counts all previous messages in the thread. The reducing cost/input tokens was the reason I switched from using chat to the assistant, aiming to eliminate the repetition of commands.

gianluca.suzzi · January 24, 2024, 3:59pm

I was thinking about something similar too. Anyway in this way all the chat history will be lost, so if you need to perform a multiturn conversation (as in my use case) this could be a problem, but i think it’s the only way to limit the tokens used for each run (and the cost): today 3 questions (THREE!!) has produced 50.000 tokens (0.5€), it’s unsustainable.

stevecoffey · January 29, 2024, 6:28pm

Hey all! Steve from the OpenAI dev team here. We’re working on designing usage controls for thread runs in the assistants API, and I want to provide a preview of the proposed change and get your feedback.

What we’re proposing is to add two new parameters to endpoints that create a run:

POST /v1/threads/{thread_id}/runs
POST /v1/threads/runs

we would add an optional field, token_control to the payload that looks like this:

{
  ...
  token_control: {
   	max_run_prompt_tokens: int;
	max_run_completion_tokens?: int;
  }
}

The idea is to internally limit the number of tokens used on each step of the run and make a best effort to keep overall token usage within the limits specified.

Let us know what you think of this idea and whether it will work for your use cases!

RG2 · January 31, 2024, 2:02pm

Nice, thats what we need. Im developing an assistant to help users in my system, but with the current token/charges rules, it will be impossible to make it available to the users. It will consume the whole company revenue.

One question steve, I use the paid chatGPT application, there we pay like 20USD to use the whole month with a lot of context threads. It seems that with the API/assistant, the cost is much much higher, is there some type of effort in OpenAI to make it more affordable, so we can use the assistant in many other ways?

nivi · February 9, 2024, 4:07am

Hi Steve,

Thanks for sharing this update on the proposed changes to the assistants API. I appreciate the effort the team is putting into enhancing usage controls for thread runs.

Regarding the addition of max_run_prompt_tokens and max_run_completion_tokens, I have a couple of thoughts:

If users are able to provide a maximum token count, it implies that the system will need to somehow select specific tokens from the context pool (from retrieval) to provide to the Assistant. It would be helpful to understand more about how this selection process will work and how it ensures relevant and meaningful interactions.
Additionally, is there any consideration for incorporating a scoring mechanism, akin to a similarity score, that could aid in selecting tokens based on their relevance to the context or the user’s query? This could potentially enhance the effectiveness of the interactions by ensuring that the tokens provided align closely with the user’s intent or the context of the conversation.

I believe these considerations could further refine the proposed changes and provide users with even more control and flexibility over their thread runs.

Providence · February 28, 2024, 2:09pm

When I tried to use this yesterday, I was gently admonished by the OpenAI folks to not use undocumented endpoints. Maybe I used it incorrectly. Has anyone got this to work and is it still working?

jbmaxwell · March 23, 2024, 4:55pm

I’m wondering about a similar functionality. In my case, I want the knowledge retrieval option of the Assistants API, but I don’t want a long context—in fact, I just want single responses (like you might get from Completions). I’d wondered putting all my queries in a loop and deleting/creating a thread for each one, but that seems like it would probably introduce a ton of overhead.

TonyAIChamp · March 30, 2024, 7:42am

Is this still working for you?

I’m getting this when I try:

Exception: Failed to delete message: 401 {
  "error": {
    "message": "You've made a request to an admin-only URL. If you're an OpenAI employee, please request the necessary permissions. If not, it's our mistake that you're trying to access this URL -- please let us know how you found it at support@openai.com.",
    "type": "invalid_request_error",
    "param": null,
    "code": "missing_scope"
  }
}

Topic		Replies	Views
How to limit the number of messages or tokens that are persisted in a thread to maintain context in Open AI Assistants? API api , assistants-api , assistants-pricing	24	9686	January 29, 2024
How exactly do you get charged for using the API for assistants? API assistants-api	33	7508	November 27, 2023
Assistants API Pricing and Token Usage API api , pricing	104	32971	February 27, 2024
Assistants API pricing details per message API api-billing	68	41426	January 29, 2024
Token consumption: Prompt tokens exponentially increase when using Threads (Assistants) API assistants-api	8	627	September 5, 2024

Reducing Context Tokens in Assistant Threads

Related topics