Currently I’m testing Assistant API and I have a question. As I understand, instructions given to Assistant is sent as context and these instructions are counted as context tokens, am I right?
The language of model usage was changed around a bit in the new usage page in November, with the release of assistants (and with that page showing less detailed information).
Like the terms “GPT” and “Assistant” already used for other purposes, our ability to refer to things clearly is also now conflated by the new use of “context” to refer to yet another thing.
|Completions API object
That alone should help clarify the multiple terms for the same thing.
Assistants also have non-transparent usage, where, for example, you have an entire “thread” conversation that must be sent to an AI model as an input for each question. That input and new output the AI writes can be repeated multiple times and grow as the assistants also makes multiple internal function calls to retrieval, code interpreter, or the developer’s own API tool functions - all to finally form one response to be read by the end user.
Uses input tokens:
- Your provided instructions
- OpenAI-provided instructions of assistants
- Your provided tool specifications
- OpenAI internal undisclosed tool specifications of assistants
- Conversation History from thread (that you budget or assistants sends at max)
- Files from thread and Files attached to assistant, inserted by RAG automation
- Files from thread and Files attached to assistant, retrieved by function
- Code interpreter results
- Past AI function language and results added to conversation thread
Uses output tokens:
- The AI generating language for all those internal purposes besides writing to the user
No per-run token usage is provided to you by assistants API, only the daily bar graphs of a web page.
When making direct calls to API models through the completions endpoints, token usage is either received or directly calculable. You see all the tokens you sent and received in those individual calls (except for slight obfuscation of counted/billed by tools/function overhead).
AI models are memoryless; they must be loaded with prompt information each call before they can generate based on that, including instructions (system messages) or earlier chat.
So “context” generally: additional information we humans need to know to understand how to answer; also all the information the AI needed to know every single model call – loaded into a memory area called “context window” where answers are also formed.
Hopefully that covers all facets of usage and billing you might encounter.
So based on your explanations, the Context Tokens usage in a Thread grows at each interaction as the AI models is memoryless and each interaction add context to the Thread ?
ContextUsage(t+1) = MIN(ContextUsage(t), MAX_CONTEXT_LENGTH) + (userInput(t+1) + functionCall(t+1) + ...)
Meaning that if the cost of a Thread Context Tokens reach 1$, all subsequent conversation calls will cost at least 1$ ?
The answers are yes.
I read this thread that answer all my questions. Got a case in a debugging session with no Context Token limitation in a
gpt-4-turbo-preview model that burned all my credit (20$) because subsequent
Run calls were charged more than 1$ each …
This charge mechanism should be documented somewhere in the OpenAI doc in details, all my app workflow needs to be reworked considering this
Thread growing context mechanism.
I need to think about solutions to reduce a
Thread context, like summarizing the
Thread context when reaching a certain limit and create a new
Thread with this summarized information as a pre-prompt history context.