I hope this message finds you well. I am reaching out for guidance regarding an issue I’ve encountered with the management of context tokens in assistant threads.
I have observed that the “context tokens” count in my assistant threads is significantly high, leading to increased associated costs. This seems to be due to the accumulation of older, unnecessary messages in the threads.
Is there a way to selectively delete old messages from these threads? My goal is to remove redundant or irrelevant content to streamline the conversation and reduce the context tokens being processed repeatedly.
Alternatives Sought: Additionally, I am open to any alternative suggestions or best practices that could help in managing context tokens more efficiently. If there are strategies or tools within (or outside) the OpenAI framework that could aid in this, I would greatly appreciate learning about them.
They recently updated the documentation indicating that they are “exploring” the concept of not having a potential of >$1/message.
Currently, the Assistant will include the maximum number of messages that fit in the context length. We plan to explore the ability for you to control the input / output token count beyond the model you select, as well as the ability to automatically generate summaries of the previous messages and pass that as context. If your use case requires a more advanced level of control, you can manually generate summaries and control context with the Chat Completion API.
Why they introduced this alongside a gpt-4 with a 128k context window will never make sense to me.
Alternatively, if you like the Assistants framework (which to me is conceptually amazing) you can start with the gpt-4 and then “downgrade” to a gpt-3.5 model, effectively reducing your max token count. I can’t remember, may be off but I believe at 16k the max is $0.01 or $0.02 per message and is plenty enough context to carry a conversation.
For my use-case I use GPT-4 initially to set a bunch of user variables (via function calling and instructions). Once they’re noticed, it’s switched to GPT-3.5. It took a little bit of adjusting for the instructions but it’s not even noticeable.
Hey @yako right now, we don’t offer an option to do this, the assistant will try to keep as many messages in context as it can and then just drop old messages as new ones are added and the context window fills us.
The context on why this is the starting point is that ChatGPT uses a similar heuristic. We know this is not perfect and are exploring other options, many folks have asked for this so stay tuned!
Good call. Looks like there’s an undocumented/beta API endpoint.
# delete a message
def delete_message_gpt(thread_id, message_id):
# have to hit this one directly, as it's unpublished/beta
url = f"https://api.openai.com/v1/threads/{thread_id}/messages/{message_id}"
headers = {
"Authorization": f"Bearer {app.config['OPENAI_API_KEY']}",
"Content-Type": "application/json",
"OpenAI-Beta": "assistants=v1",
}
response = requests.delete(url, headers=headers)
if response.status_code != 200:
raise Exception(f"Failed to delete message: {response.status_code} {response.text}")
Worth noting, the ability to delete messages from a thread is a useful user feature in ChatGPT. So it makes sense to have an endpoint so you can delete user messages and retry, or delete assistant messages and do a new run.
I noticed that you mentioned a beta/undocumented endpoint for deleting messages in a ChatGPT thread of assistant api. Have you had a chance to test this endpoint? If so, did it work as expected?
Hi … it does work, but gave me some unexpected results like responding to messages I thought were deleted. I didn’t spend much more time on making it work, as I didn’t want to have a production system depending on an undocumented endpoint.
My suspicion is that if you delete a message in the middle of the thread, you have to make sure to delete from the end of the thread all the way back to the message you want to delete, in the proper order so there are no orphan message nodes. I just never got around to trying that.
It seems to work but leaves the function calls and their outputs including code_interpreter responses. However much you try to delete them with their IDs nothing will happen
Has anyone else tried this undocumented beta endpoint to delete messages because I want to give it a try so I can do anything to reduce my Contacts tokens
Hi @yako Have you tried creating a new thread for each new question/message addressed to the assistant?
@logankilpatrick could you please advise is there any limitations on the number of threads per assistant?
I am planning to use that approach (new message-new thread). Additionally it will help to send simultaneously as many requests/messages as I need, it’s no need to wait until the run will be completed. In my case, I have long commands and short messages/questions that are unrelated to each other. To reduce costs, I provide my commands to the assistant once and then ask the assistant to analyze the user’s message.
By the way, thank you for your question. It helped me understand that each run of the assistant counts all previous messages in the thread. The reducing cost/input tokens was the reason I switched from using chat to the assistant, aiming to eliminate the repetition of commands.
I was thinking about something similar too. Anyway in this way all the chat history will be lost, so if you need to perform a multiturn conversation (as in my use case) this could be a problem, but i think it’s the only way to limit the tokens used for each run (and the cost): today 3 questions (THREE!!) has produced 50.000 tokens (0.5€), it’s unsustainable.
Hey all! Steve from the OpenAI dev team here. We’re working on designing usage controls for thread runs in the assistants API, and I want to provide a preview of the proposed change and get your feedback.
What we’re proposing is to add two new parameters to endpoints that create a run:
POST /v1/threads/{thread_id}/runs
POST /v1/threads/runs
we would add an optional field, token_control to the payload that looks like this:
The idea is to internally limit the number of tokens used on each step of the run and make a best effort to keep overall token usage within the limits specified.
Let us know what you think of this idea and whether it will work for your use cases!
Nice, thats what we need. Im developing an assistant to help users in my system, but with the current token/charges rules, it will be impossible to make it available to the users. It will consume the whole company revenue.
One question steve, I use the paid chatGPT application, there we pay like 20USD to use the whole month with a lot of context threads. It seems that with the API/assistant, the cost is much much higher, is there some type of effort in OpenAI to make it more affordable, so we can use the assistant in many other ways?
Thanks for sharing this update on the proposed changes to the assistants API. I appreciate the effort the team is putting into enhancing usage controls for thread runs.
Regarding the addition of max_run_prompt_tokens and max_run_completion_tokens, I have a couple of thoughts:
If users are able to provide a maximum token count, it implies that the system will need to somehow select specific tokens from the context pool (from retrieval) to provide to the Assistant. It would be helpful to understand more about how this selection process will work and how it ensures relevant and meaningful interactions.
Additionally, is there any consideration for incorporating a scoring mechanism, akin to a similarity score, that could aid in selecting tokens based on their relevance to the context or the user’s query? This could potentially enhance the effectiveness of the interactions by ensuring that the tokens provided align closely with the user’s intent or the context of the conversation.
I believe these considerations could further refine the proposed changes and provide users with even more control and flexibility over their thread runs.
When I tried to use this yesterday, I was gently admonished by the OpenAI folks to not use undocumented endpoints. Maybe I used it incorrectly. Has anyone got this to work and is it still working?
I’m wondering about a similar functionality. In my case, I want the knowledge retrieval option of the Assistants API, but I don’t want a long context—in fact, I just want single responses (like you might get from Completions). I’d wondered putting all my queries in a loop and deleting/creating a thread for each one, but that seems like it would probably introduce a ton of overhead.
Exception: Failed to delete message: 401 {
"error": {
"message": "You've made a request to an admin-only URL. If you're an OpenAI employee, please request the necessary permissions. If not, it's our mistake that you're trying to access this URL -- please let us know how you found it at support@openai.com.",
"type": "invalid_request_error",
"param": null,
"code": "missing_scope"
}
}