Assistant API / costs / where do I find my token consumtions in assistants|messages|threads

Hello there

I’m using the Assistant API in the sandbox, in a very typical retrieval one-shop logic.
I have made the nice instructions (aka the system_prompt)
I have uploaded a simple .txt file that contains about 2000 lines (approx 36k tokens)
The system prompt basically says something like “i’m gonna give you an input, now identify the best line in my text file that best matches the input”. You can think of a silly classification game with ~2000 classes.

I’m happy with the outcome. it just works. Now, why do I use an assitant instead of just sending the entire big long list in a classic system_prompt and use the completion engine?

  1. I want to learn/discover how to use the assistant API

) I wanted to check how good it is compared to a classic “completion” approach
3) the cost

using this time about 15 times today, cost me about $2.13 so approx 213/15 = 14 cents of dollar per run. (using gtp4 turbo)

in the previous chat completion API, I python I would typically use something like

client = openai.OpenAI()
response =
    response_format={"type": response_format}
t1 =
res = response.choices[0].message.content
input_tokens = response.usage.prompt_tokens
output_tokens = response.usage.completion_tokens
total_tokens = response.usage.total_tokens
duration = (t1 - t0).total_seconds()

and then I would keep the token_counts and the model used to be able to know how much money cost be a given run.

But with the new api, I’m kinda lost and I don’t know / understand / find how to basically know how much money cost me a given run.

How I am supposed to do this?

There is no usage returned. The account usage page was also altered to only give daily model statistics by the whole cent - the day that assistants debuted. It is clearly intentional that you do not know but still pay.

You’d think that you could use by-the-minute rate limits to probe your available rate tokens in headers, but the assistants can make multiple calls and spin their wheels for quite a while, and the formula refreshes by tokens/milliseconds so that is also an elusive workaround idea.

User avoidance purely on these policies should be OpenAI’s reward.


Absolutely agree here. Just say TBNT until openAI steps up and lets people know what they are paying for and how much they are paying. It certainly does seem intentional to just let assistants do whatever without the dev knowing what and how much. Not sure if it is a scam but it is just bad business and a bit shady.
I do like them trying to streamline things, but i’m happy to store my chat history and files etc on my server and use chat completions, at least there I know token counts and pretty much everything that is happening, where with assistant api I am totally in the dark.
Say no go!

1 Like

well for me… this is nearly a no go.

And that gave me the motivation to learn and experiment with embeddings. And I kinda like what I have seen until now. It’s more work, but much more control and understanding about what’s going on

I just need to find a way to slice and dice the documents into the right “chunk size” before I calculate the embeddings. any clue about what’s the right way to split a given document? any rule of thumb that everybody in the AI / machine learning game knows but I happen to ignore?

You should be able to find lots of tutorials about chunking. Especially if you’re using Python. Best of luck, you’ll get this nailed down for sure.