Hi, I am learning the API so please pardon if the question is weak. Specifically for the Assistant API (although this might apply to all APIs) I start giving the Assistant some fairly lengthy instructions. Does OpenAI charges me tokens for the same initial instructions on every thread? Suppose the instructions are 1000 tokens and the assistant executes 100 threads, does that means I will be charged 100,000 tokens just for the instructions part? And if that’s the case, is there a way to make this less token expensive? something that will make me get charged for the same instructions only once? How about other APIs? What I have seen so far is that you have to start from zero instructing ChatGPT every time an interaction starts, if it is thousands of interactions in a day this can add up real fast and become too expensive. I really hope I got this one wrong, thanks.
The way an API AI language model works for chat or new knowledge:
- You have to supply all the data you want it to know, like previous chat, in every following independent API call.
- The processing of the input text tokens up to where the AI can then generate its own tokens is computationally expensive
Assistants aggravate this:
- They offer no limitation on how long a conversation can grow (a thread) until the model is at its context length limit
- They inject other text instructions, they load knowledge without regard to its present relevance, etc.
- They can iterate internally multiple times doing a task, each a new call to an AI model, even looping on errors.
A costly proposition, with many forum cautionary tales.
So the place to start is with chat completions API method and your code, where you have direct access to the input of the AI, and can give just the instructions and chat history length needed each turn to maintain the topic or quality you desire.
Hi Jay,
I am looking at the Chat Completion API and I am not sure I understand the advantage over an Assistant API. If I understand correctly, in Chat Completion I would have to send the entire history of the conversation every time there is a reply, since the model does not maintain a history. Meaning if the assistant and user go back and forth 20 times I would have sent the full ever growing chat history 20 times. That would be a humongous amount of tokens. Example:
user: knock knock
assistant: who there?
user: Mick.
assistant: Mick who?
in this interaction of 4 lines the first time the message array size is 1, the second time the size is 2, (total 3 messages) the third time history size is 3 (total so far 6 messages) the fourth time messages size is 4 and now total message history is 10! I would have to pay tokens for 10 messages. Could you image if the interaction gets to 50 lines?
In the Assistant API you don’t have to send the entire history as the model maintains the history I assume by tracking the threadId. Can you please correct me if I am wrong, which I hope I am.
Thanks Jay
In the assistants API, you don’t have to send conversation - because the assistant backend sends it to the AI model for you without limit. The AI model is still without memory, other software is filling it with conversation for each internal API call (and this model-calling can repeat internally multiple times). You have no control over what the agent is loaded with, while you still pay for input tokens. Your choice of clever context management or budget consciousness is not a concern to this agent.
Was keeping everything the assistant answered important above for understanding the topic?
Now imagine every conversation loaded with every tool call and every return, every past knowledge retrieval, without concern to obsolescence. Or without knowing the accuracy of iterative conversation. That’s their product.
And what is the benefit? Imagine you want to make ChatGPT’s interface yourself (or an even better chat bot):
Assistant thread:
Retrieval of chats per customer? NO
Retrieval of chat titles? NO
Retrieval and inspection of each internal tool call? NO
Controlled truncation without altering chat display? NO
Placing multiple roles? NO
Branching conversations with edits? NO
Disabling and re-enabling messages? NO
Altering prior responses? NO
and the NOs can go on…
Control over the tokens of additional unseen instructions? NO
Control over the relevance of knowledge retrieved? NO
…
With your own chat completions, management of conversation length is as easy as chat_history[-10:]
or by counting their tokens. Put a input token slider in the interface that adjusts what is sent, greying out what can’t be sent, and showing the price dynamically.
The Assistants API keeps your entire conversation, whereas the Completions API allows you to prune, summarize, etc.
So if the assistant and user have a 100 message conversation, Assistants will pass those 100 messages every single time. With Completions, you can decide to maybe only keep the last 10 or the most important 10.
One other thing, with Completions, you get back exactly how many prompt tokens and how many completion tokens were used, so you can keep a running tally of how much usage is costing. The Assistant API does not give you that information. You have to check your usage page to see how much you’ve spent.
Until OpenAI gives us more information about the Assistant API usage/costs, I would stick with the Completions API.