Assistants API token usage and pricing breakdown clarification

I like many others had been confused with how assistants API is being implemented and how they would be billed. The feature at first looks like a game changer eliminating a lot of backend coding and setup required to develop LLM applications but the current economics of the feature is not usable for production.

Pricing / Billing

I wanted to find out more details about how the assistant messages are being billed which led me to run a few tests and this is what my inferences were (please correct me if I’m wrong)

I ran the test by creating a new thread T1 for an assistant A1 and passed the same message in the same thread T1 to generate an output. My goal was to see the difference in the billing dashboard usage tokens when they are updated and gauge the usage (I used an account where the usage was only being recorded for the test and nothing else to eliminate any discrepancies).

It took me some time because I had to wait for the billing dashboard to refresh with new usages every time, but I made 4 calls comparing the input tokens (now context tokens) and output tokens for each call and here is what I found.

I’m aware of the decreasing number of generating tokens in each call, however it was expected for this test and I’ve verified the actual natural language responses which were all valid

As you can see the first call was 143 tokens and it kept increasing linearly from there clearly indicating that the entire history is being passed every time and we are being billed for it.

Assistant Instruction: 26 Tokens
Thread Message: 8 Tokens

Call 1

The total tokens including instruction and message is 34 tokens but it came out to be 143 tokens (remaining 109 tokens I’m guessing openai instructions and special tokens).

Call 2

So after call 1, we had 242 tokens in history (input + output) from call 1. However, in the next call input tokens were 256 which is about 14 (8 of which is my message input) tokens extra, which leaves 6 tokens extra. After playing around with openai tokenizer my best guess was it’s the formatting.

user:
assistant:

The above comes out to be 4 tokens using openai tokenizer.

The remaining 2 tokens would be instruction / message start,stop tokens.

Call 3,4

Using the same login the extra buffer of 6 tokens remains same for both the calls

Suggestion

Instead of passing the entire context could we maybe implement RAG on the message history and only pass in the relevant parts as context instead of the entire message history. This would save a lot of tokens every time with a small trade-off for performance which I’m guessing everyone would be willing to make given the current pricing (you can try working out projections for your estimated usage from the above approach)

13 Likes

Thanks for doing this experiment. I noticed that my run with a small test Assistant used 109k tokens and when I spent 2 or 3 hours using AutoGen previously it wasn’t nearly that costly. Even ChatDev cost me less tokens.
I think your idea of using RAG or some kind of system to keep from sending the entire chat every time needs to be implemented.

1 Like

I noticed some weird stuff on the GPT-3.5-turbo model where messages were re-generated in a loop several times. Here is my issue / bug report. Maybe you want to check your usage and see if you have the same issue:

I have been having issues with the Assistant API as well. I ran it twice a few minutes ago, I asked the same question with the same instructions:

First run did not return anything (error after 30 seconds) and used 40k input tokens… Yes 40,000. With GPT-3.5 which doesn’t have that big of a context window to begin with.

Second run returned an accurate message in 4 seconds and used 1500 tokens.

2 Likes

I’m in the same situation as you.
I’m using GPT4-turbo model.
I asked a question and the input token turned out to be as high as 40k.

That exact jump in token usage happened to me too- one response randomly used 40k tokens and then the next response (with the same input) dropped to 3k.

For the 40k token response, the usage reported from the list run step API call only showed ~4k tokens used in total across two run steps. Not sure where those 36k tokens are hiding.

Has the encoding of the GPT-4-1106-preview model changed in the update on 25th updated? Since then, the number of tokens in my prompts has started to fluctuate.

Same here. Was very pricy. I’m wondering if it’s because I’m starting a new thread with a single message every time I call the assistant API (where it returns parameter info for one function). I’m just messing around atm, but this week I’m going to see if it’s more efficient to just keeping the thread open, and use messages.

My stats for today, and I only used it for about 10 minutes:
$6
170 api requests
9,000 generated tokens
176,000 context tokens
aka About 185,000 total tokens used

Can anyone confirm that re-using the same thread, and sending messages is much more efficient cost-wise than re-opening a new thread? It seems obvious, but just making sure.

are you able to found any reason and solution as i am facing the same issue too.

No, I don’t found any solution for this.

But I tried to ask the different question, the input token became 5k rather than 40k.

Due to the OpenAI’s Assistant API is closed source.
We have no idea about how they chunk or embed the data.

I rebuilt my whole application with langchain, pinecone, and the regular API instead of assistant. It is significantly faster, more accurate, and uses about 60-70% less tokens on average.

IMHO, Assistant API says it is in BETA… and it is definitely in BETA, nowhere near ready for prime time.

1 Like