That will save the need for using httpx routines for…
The “Token Tamer” function operates as an automated control mechanism, managing and regulating excessive API usage within an AI assistant by implementing several key features:
Token Tagging: Each message is permanently tagged with metadata that includes token usage information and contextual subtopics of the ongoing conversation. This tagging system allows for efficient tracking and identification of the tokens associated with each interaction.
Threshold of Termination: The function sets a predetermined threshold for message deletion, either by the number of conversation turns or the total tokens consumed. Once this threshold is reached, the Token Tamer initiates the process of deletion.
Priority Specification: Users have the flexibility to specify the priority of retaining certain messages over others. This includes the ability to prioritize keeping user messages or AI-generated responses as a percentage or based on the context or importance within the conversation.
Overall, the Token Tamer function acts as a sophisticated regulator, employing metadata tagging, threshold management, and customizable priority settings to efficiently manage token usage and prevent excessive consumption within AI models.
Better would be the forethought of (max_turns, max_input_tokens, semantic_lookup_percent) threshold parameters on a thread or assistant, and the limited conversation “passing” to an AI model is automatic.
I’m not sure that’s fair. Beta is beta, and there’s a lot that of evolution that could happen here. I’m hopeful that — as seems pretty natural — (many) new features will be added to provide control over how context is constructed from thread history.
(Though, honestly, if it stays as is, you’re kinda right. That would be sad.)
Hey all! Steve from the OpenAI dev team here. We’re working on designing usage controls for thread runs in the assistants API, and I want to provide a preview of the proposed change and get your feedback.
What we’re proposing is to add two new parameters to endpoints that create a run:
POST /v1/threads/{thread_id}/runs
POST /v1/threads/runs
we would add an optional field, token_control to the payload that looks like this:
The idea is to internally limit the number of tokens used on each step of the run and make a best effort to keep overall token usage within the limits specified.
Let us know what you think of this idea and whether it will work for your use cases!
What you describe doesn’t ultimately control the usage based on specification. It sounds like it could just throw an error or truncate output.
API developers can handle programming. If assistants will ever be useful for production and products, you must target skilled developers. The trick is in making all specifications, or objects with sets of specifications, optional, and allow incrementally building on control beyond default values.
The below also isn’t fleshed-out (and can’t be as we are provided a black box at this point), but can be an idea of where API developers can desire controls
{
run_budget { # threshold disables all tools, forcing user output
max_steps - number of internal iterations to allow
total_completion_tokens - total tokens of all internal steps to allow
total_tool_tokens - total accumulated tool input context from iteration
}
{
run_limits { # thresholds immediately terminate run with error if reached
total_completion_tokens
total_input_tokens
}
step_context_budget {
max_input - limits context as if model had smaller input capability
max_tokens - truncates the output of all internal generations
}
retrieval_injection {
max_tokens - tokens to terminate automatic knowledge injection
similarity_threshold - semantic threshold to block irrelevance
}
retrieval_browser {
search_max_return_items
search_max_tokens
search_similarity_threshold
click_max_tokens
}
tool_context {
tool_max - truncates AI loading from python or returns
}
conversation_context {
max_tokens
max_turns
favor_conversation - 0-100 importance of maintaining chat vs retrieval
}
…and then expose temperature and top_p. Or for function-calling AI in general, even an immediate sampling override upon production of a send-to-tool-recipient token.