I like many others had been confused with how assistants API is being implemented and how they would be billed. The feature at first looks like a game changer eliminating a lot of backend coding and setup required to develop LLM applications but the current economics of the feature is not usable for production.
I wanted to find out more details about how the assistant messages are being billed which led me to run a few tests and this is what my inferences were (please correct me if I’m wrong)
I ran the test by creating a new thread T1 for an assistant A1 and passed the same message in the same thread T1 to generate an output. My goal was to see the difference in the billing dashboard usage tokens when they are updated and gauge the usage (I used an account where the usage was only being recorded for the test and nothing else to eliminate any discrepancies).
It took me some time because I had to wait for the billing dashboard to refresh with new usages every time, but I made 4 calls comparing the input tokens (now context tokens) and output tokens for each call and here is what I found.
I’m aware of the decreasing number of generating tokens in each call, however it was expected for this test and I’ve verified the actual natural language responses which were all valid
As you can see the first call was 143 tokens and it kept increasing linearly from there clearly indicating that the entire history is being passed every time and we are being billed for it.
Assistant Instruction: 26 Tokens
Thread Message: 8 Tokens
The total tokens including instruction and message is 34 tokens but it came out to be 143 tokens (remaining 109 tokens I’m guessing openai instructions and special tokens).
So after call 1, we had 242 tokens in history (input + output) from call 1. However, in the next call input tokens were 256 which is about 14 (8 of which is my message input) tokens extra, which leaves 6 tokens extra. After playing around with openai tokenizer my best guess was it’s the formatting.
The above comes out to be 4 tokens using openai tokenizer.
The remaining 2 tokens would be instruction / message start,stop tokens.
Using the same login the extra buffer of 6 tokens remains same for both the calls
Instead of passing the entire context could we maybe implement RAG on the message history and only pass in the relevant parts as context instead of the entire message history. This would save a lot of tokens every time with a small trade-off for performance which I’m guessing everyone would be willing to make given the current pricing (you can try working out projections for your estimated usage from the above approach)