How to limit the number of messages or tokens that are persisted in a thread to maintain context in Open AI Assistants?

_j · December 4, 2023, 5:58pm

That will save the need for using httpx routines for…

The “Token Tamer” function operates as an automated control mechanism, managing and regulating excessive API usage within an AI assistant by implementing several key features:

Token Tagging: Each message is permanently tagged with metadata that includes token usage information and contextual subtopics of the ongoing conversation. This tagging system allows for efficient tracking and identification of the tokens associated with each interaction.

Threshold of Termination: The function sets a predetermined threshold for message deletion, either by the number of conversation turns or the total tokens consumed. Once this threshold is reached, the Token Tamer initiates the process of deletion.

Priority Specification: Users have the flexibility to specify the priority of retaining certain messages over others. This includes the ability to prioritize keeping user messages or AI-generated responses as a percentage or based on the context or importance within the conversation.

Overall, the Token Tamer function acts as a sophisticated regulator, employing metadata tagging, threshold management, and customizable priority settings to efficiently manage token usage and prevent excessive consumption within AI models.

Better would be the forethought of (max_turns, max_input_tokens, semantic_lookup_percent) threshold parameters on a thread or assistant, and the limited conversation “passing” to an AI model is automatic.

Orome · December 19, 2023, 9:09pm

I’m not sure that’s fair. Beta is beta, and there’s a lot that of evolution that could happen here. I’m hopeful that — as seems pretty natural — (many) new features will be added to provide control over how context is constructed from thread history.

(Though, honestly, if it stays as is, you’re kinda right. That would be sad.)

steve35 · January 16, 2024, 3:19am

So will this delete message function help me reduce Context tokens and is this something usable today? What’s the latest?

stevecoffey · January 29, 2024, 6:49pm

Hey all! Steve from the OpenAI dev team here. We’re working on designing usage controls for thread runs in the assistants API, and I want to provide a preview of the proposed change and get your feedback.

What we’re proposing is to add two new parameters to endpoints that create a run:

POST /v1/threads/{thread_id}/runs
POST /v1/threads/runs

we would add an optional field, token_control to the payload that looks like this:

{
  ...
  token_control: {
   	max_run_prompt_tokens: int;
	max_run_completion_tokens?: int;
  }
}

The idea is to internally limit the number of tokens used on each step of the run and make a best effort to keep overall token usage within the limits specified.

Let us know what you think of this idea and whether it will work for your use cases!

_j · January 29, 2024, 8:27pm

What you describe doesn’t ultimately control the usage based on specification. It sounds like it could just throw an error or truncate output.

API developers can handle programming. If assistants will ever be useful for production and products, you must target skilled developers. The trick is in making all specifications, or objects with sets of specifications, optional, and allow incrementally building on control beyond default values.

The below also isn’t fleshed-out (and can’t be as we are provided a black box at this point), but can be an idea of where API developers can desire controls

{
run_budget {  # threshold disables all tools, forcing user output
    max_steps - number of internal iterations to allow
    total_completion_tokens - total tokens of all internal steps to allow
    total_tool_tokens - total accumulated tool input context from iteration
}
{
run_limits {  # thresholds immediately terminate run with error if reached
    total_completion_tokens
    total_input_tokens
}
step_context_budget {
    max_input - limits context as if model had smaller input capability
    max_tokens - truncates the output of all internal generations
    }
    retrieval_injection {
        max_tokens - tokens to terminate automatic knowledge injection
        similarity_threshold - semantic threshold to block irrelevance
    }
    retrieval_browser {
        search_max_return_items
        search_max_tokens
        search_similarity_threshold
        click_max_tokens
    }
    tool_context {
        tool_max - truncates AI loading from python or returns
    }
    conversation_context {
        max_tokens
        max_turns
        favor_conversation - 0-100 importance of maintaining chat vs retrieval
}

…and then expose temperature and top_p. Or for function-calling AI in general, even an immediate sampling override upon production of a send-to-tool-recipient token.

Thanks for hearing my thoughtful rambling.

Topic		Replies	Views
Reducing Context Tokens in Assistant Threads API assistants	21	9504	July 8, 2024
Assistant API Max input context size API	5	2051	April 16, 2024
How exactly do you get charged for using the API for assistants? API assistants-api	33	7366	November 27, 2023
OpenAI Assistant maximum token per Thread API gpt-4-turbo	11	11332	May 28, 2024
Token consumption: Prompt tokens exponentially increase when using Threads (Assistants) API assistants-api	8	572	September 5, 2024

How to limit the number of messages or tokens that are persisted in a thread to maintain context in Open AI Assistants?

Related topics