Hi, I am using beta version with assistant.
I did the implementation as mentioned in documentation but I have very high promt_tokens usage. I use same thread to run messages and promt_tokens getting higher in each run. Please help
here is my code to get response:
private async runThreadByAssistant(threadId: string): Promise<any> {
const run = await this.openai.beta.threads.runs.create(threadId, {
assistant_id: 'asst_3123213'
});
return run;
}
public async getResponse(threadId: string): Promise<any> {
let run = await this.runThreadByAssistant(threadId);
while (['queued', 'in_progress', 'cancelling'].includes(run.status)) {
await new Promise(resolve => setTimeout(resolve, 1000)); // Wait for 1 second
run = await this.openai.beta.threads.runs.retrieve(
run.thread_id,
run.id
);
}
let response:any;
if (run.status === 'completed') {
const messages = await this.openai.beta.threads.messages.list(
run.thread_id,
{order:"desc", limit:1}
);
response = messages.data[0].content[0]
} else {
response = run.status;
}
return response;
}
There are many ways that Assistants, which is a free-running iterative agent, will consume model tokens beyond just your input and output. You can’t absolutely understand or control what happens within the opaque operations.
@_j thank you for the information. it is wired that when I create new thread for each request the token usage is not that much however, when I use same thread and run multiple messages, in each run token usage getting higher.
I have a question, is there a limit to create thread(except requests limit)? I plan to create new thread for each my user’s request
The messages are saved in a thread, growing forever as long as the chat session continues, with the maximum that will be sent into the AI model each call being nearly the maximum context length of the model specified. You are not saving money by letting Assistants manage the conversation; you are maximizing the expense.
It is almost like they said “we made preloading of the input state take only milliseconds regardless of length, by degrading the model attention mechanism, now how can we maximize billing for something that costs little?”
Unless you have a specific use that you simply can’t code yourself, chat completions and direct model interaction is preferable.
There is no limit given to the number of threads that can be created per organization. You really can’t “create a new thread for each request” unless you want to discard everything the AI said previously to give a chatbot no memory.
It’s not transparent on how the tokens are calculated using assistant. I have a 300KB JSON, containing title, URL and context attributes, uploaded to assistant and asked two questions:
Is the product GDPR compliance? (868 tokens)
What are the available plans and features? (33,253 tokens)
33K tokens is almost close to $1 per message, it’s more expensive that Satellite messaging.
I have compared the token usage and quality of answers by uploading the embedding of the 300KB JSON to vector DB, get the search result context from vector DB and pass it to chat completion for answer.
Is the product GDPR compliance? (2,109 tokens)
What are the available plans and features? (2,217 tokens)
The vector search for sure having less token usage since it’s using chat completion to refine the result context from vector search. However the quality of the answer is not as smart as Assistant.
I might end up choosing the later option which is creating our own assistant, storing the embedding in third party vector DB. The answer is not as great as assistant but at least the token usage is somewhat reasonable.
for me the problem is solved when I use new version.
let run = await openai.beta.threads.runs.createAndPoll(
thread.id,
{
assistant_id: assistant.id,
instructions: "Please address the user as Jane Doe. The user has a premium account."
}
);
I don’t think so, I am using Assistant with streaming, plus retrieval/function calling. As @_j mentioned, creating new thread for every request will lower your cost, but it will leave the assistant no memory.
There is no reason to use Assistant this way. I can use chat completion + function calling or embedding to achieve the same purpose with even lower cost.