Assistants API Pricing and Token Usage

Also interested in this.
The feature seems really nice, but also a bit of black box. Can’t really deploy it like this, since costs could spiral out of control

Accompanying the roll-out was the apparent need to get rid of easy-to-retrieve token usage reports by the number of calls per five minutes in your account. All you can see is daily tokens.

Take a model that costs 1/10th of a penny per 1000 tokens. People have had the assistant agent go crazy with loops and browsing data for over $1 per call. That’s the kind of thing where you obfuscate the data about your product.

An AI that is “our fastest, lowest computation, lowest cognition, chat model yet” is given your bank password.

1 Like

I agree they need to give us a better view into:

token utilization
RAG utilization

costs to I/O, retrieval, etc…

1 Like

GPT being stateless and needing to go back through the conversation history makes cost get exponentially larger. Hopefully one day we can have a stateful GPT where context is stored on the server side, at least for a time.

1 Like

100% agree … Good analysis Ryan.

It’s unclear how much of tokens are used for the context in retrieval. So I’m wondering if a 30,000 word PDF is uploaded into retrieval, will it use the entire 30,000 words in each Assistants API call? (that would really increase costs on each API call)

3 Likes

I can’t say I have a definitive answer, but in the example I posted above, I asked the model to cite the specific sections where it found the previous answer. It responded that it needed to read through the entire (750 page) document again to do that. I don’t know if it read through the entire document, or just up to the point where it found the citations, but it is logical to assume it charged tokens for however much it had to re-read. If it has to do this for every subsequent question it is asked, the larger your document, the more expensive each one of those questions is going to be. Yikes!

1 Like

The docs mentioned they do Chunking on the retrieval

Is it bad practice to take each response and upload it as a File for later RAG?

Really appreciate the feedback here, everyone!

We certainly plan to expose more details about token usage and billing on Runs and in the usage page of the dashboard. We have plenty to improve during this beta period (streaming, usage, so much more!), but wanted to get the Assistants API in your hands early.

Please keep the feedback coming, and we’ll share more as these features ship.

16 Likes

Yea, I were also trying to understand the actual cost per run but can’t find any information in the documentation and pricing. Hopefully it get updated soon!

I think it will also be better if I can specify the max input & output token limit when I don’t need 128k context window but still want to leverage GPT-4 turbo. If using 128k than when messages fullfills the whole window the cost will be insane.
Thanks atty!

3 Likes

That’s why I started to think, that using an assistant API is just out of any control. I was really excited about this roll out, but now I will stick to standard completion API, Where Im in a control of token usage,and I can mange the context of conversation by summarizing older messages, instead of holding all of them in context window. It seems that using assistant gives a quite low ratio of benefits/new problems

4 Likes

Same here. I think it’s good if you’re just starting out for an MVP, but especially if you already have things like summarization and vector DB in place there doesn’t seem to be a lot of benefit in using this, unless we’re missing something. I hope the team comes up with some answers and solutions that are actually an upgrade to what most of us already have set up.

1 Like

Regarding the token utilization, you can already see that information in analytics. At least you can see the split between ‘Context tokens’ and ‘Generated tokens.’

1 Like

The usage site has been way downgraded. On purpose?

At best you can get per-day reports. After knowing you have to go to activity, and pick the hidden models. And usage cost that can show up many hours later.

Was my fine tune job free? No, just took another day to see what it cost.

1 Like

Same here : API doesn’t return any details about token usage. Waiting for a fix

6 Likes

Thought I was going crazy because usage wasn’t returned on a ‘run = complete’ API call, while it does look like the token usage for context window is huge relative to token usage for generation, the cost for context is fairly reasonable.

From what I can tell it seems to be dumping the contents of files directly into the context window up to a certain size (as Sam shared in the keynote), I haven’t seen any evidence of embeding/rag retrieval style stuff in my tests, every call seem to use a large amt. of tokens.

I’m thinking a better approach moving forward for me will be to use traditional chat calls and mix in assistant calls when it comes time to add in very domain specific data…

Or better yet, use assistant as a RAG to generate a set of specific data needed from a data set, store that temporarily in a DB (or in a thread) to inject into prompts for custom context.

2 Likes

I’m really hoping they release some more documentation / information on their retrieval system. I have no doubts that they are going to be continuously upgrading it, but I would really like to know at least their philosophy so I can know how I should be preparing my documents.

At this point I’m staying away from it. The loss of controls & insights is way too much.

2 Likes

All of OP’s question are great questions. One more question related to instructions token usage.

With the new API, I no longer need to send the instructions in every message, which is a great improvement. But would instructions be counted in token usages for every single message?

1 Like

Yes, we discuss it here

1 Like

Wow, the assistants are great but the context token usage are out of this world. Used almost 300k tokens for a short chat with 3 documents which are quite small (3-5 mb range). Good idea but will shelve for now as it is completely unaffordable for any kind of real world use case.

4 Likes