Assistant API tokens usage

There is not usage when retrieving run/message/assistant using AssistantAPI, not like ChatCompletion.
Wanted to know maybe there is a way to get the cost/usage or you are working on that.

Thank you very much! :slight_smile:

2 Likes

There is no method to see how much you’ve been charged per run.

The only report that you get is daily, by model, not exclusive to assistants.

Hiding how much this actually costs in the usage page was rolled out alongside assistants.

That should be the first warning.

+1

It would be beneficial to have usage metrics per assistant, per thread, per message, and per run request, similar to the call reply in the completion API. My objective is to divide usage per assistant and prevent the sending of messages once a specified usage limit (in tokens) is reached. This is a crucial API feature for many of us, and it should ideally be implemented already.

Are you currently working on this feature?

Thank you.

+1.

It would be helpful to see usage per assistant, even if not by thread or run.

1 Like

For any developers currently building an app using the API and looking forward to bring this app to the market, there is no way we can actually charge our customers based on usage if we are unable to calculate the costs per assistant or per thread.

It is a big blocker in terms of building actual commercial solutions for customers.

1 Like

We love the new assistants API, but being able to meter usage is a crucial feature we’d need for us to be able to adopt it.

1 Like

I was thinking about trying to calculate the charge myself after each run using the price of input/output defined for the model, won’t that be accurate?

You could calculate monetary costs, knowing the model of the Assistant ID and its input and output pricing (divided by 1000 or 1M).

Since the time of this topic’s creation and past obfuscated costs of using assistants, you now get back usage billing information in the run response:

{
  "id": "run_abc123",
  "object": "thread.run",
  "created_at": 1698107661,
  "assistant_id": "asst_abc123",
  "thread_id": "thread_abc123",
  "status": "completed",
    ...
  "usage": {
    "prompt_tokens": 123,
    "completion_tokens": 456,
    "total_tokens": 579
  },
...

The actual token costs of using Assistants will overflow a 16 bit unsigned int…

the prices Pricing | OpenAI are the one used for assistants ?
lets say a user in chatting with

gpt-3.5-turbo-1106
input $1.00 / 1M tokens
output $2.00 / 1M tokens

after a run I will do the math on

terminal {
  prompt_tokens: 1288,
  completion_tokens: 527,
  total_tokens: 1815,
  prompt_token_details: { cached_tokens: 0 }
}

assuming prompt_tokens=input and completion_tokens=output
and if the user uses ( hypothetically ) 1M prompt_tokens and 1M completion_tokens, I should deduct 3$ of his balance ?
or there are some more computations that I am not aware of and if do this ; I will end up broke :sweat_smile:

The request can fail and you don’t get the report or an accurate report.

This can happen if you hit a model rate limit when assistants continues calling internally or just some general malfunction where the model got used but the call wasn’t received and accounted for properly by the backend.

So besides the value add that you are actually doing, you should allow some buffer for failures and estimate them. That is especially important when streaming chat completions, as the call can time out on you despite the model having been run, or the o1 model may give you a content policy error despite thinking about it on your bill. Even when you cancel a streaming API call, then you have to use a token counter on the input and what you partially received to make a guess, because the usage comes last.

If you are using assistants, there are additional completely unpredictable unreturned costs - the cost of $0.03 code interpreter sessions per thread that time out when they want after an hour or so, vector stores that are based on per-day per GB charges that may directly involve user-uploaded files that expire after 7 days or not with no usage report…

The chance that you make similar thread calls in close proximity and have a cache discount that you subtract from prompt at 50% can get calculation but can’t be known ahead of time.