Woa 35k tokens in one go ?!?!?!?!?!?!?

my document is 3 pages long, I asked for something that is not in the document and it took 31k TOKENS !

1 Like

It happened 2 more times? is this normal???

Wow, that seems unexpected. Have you uploaded some files as knowledge?

yea I just have a tiny file, just 3-4 pages long
even if it was 100 pages, I don’t think it should cost this much it happened 3 times in a row

I am so happy I didn’t use Assistants API with any of my clients, I would’ve had 500$ gone in 1 hour

Is the uploaded document in a language other than English?

ALl of it is in english I dunno what’s causing the problem

the bot also doesn’t reply

I think open AI is having some problems

If other people are experiencing this

It’s gonna make assistants API look sooooooo bad

There is a partial outage for API and ChatGPT right now.

I think it’s probably not exactly expected but it seems to be somewhat “normal.”

I’m tagging in @_j since they have vastly more experience with assistants, but my understanding is it’s very easy for token use with the assistants API to very quickly spiral out of control.

In this case, what might have happened is it read the entire contents of your 3-page file, which might be ~1,500 tokens and put it into context. Then it made a series of calls to the model, including that context along with everything else, every single time.

At this point, because you do not have any direct control over how the assistants API manages context or the number of calls it will make in a single run, you don’t really have much in the way of control over your API costs.

1 Like

Well described, except that retrieval has an additional token-burning feature.

Not only is the maximum amount of text from files placed into AI context regardless of relevance, it also has a function for searching, with no message “using this is stupid because you already got all uploaded file text”. The myfiles_browser has independent controls like “see search result part”, “scroll”, … and then, even has to emit a “back()” with another full context AI call just to change to a different file id or search return.

Then your conversation includes the function calls and returns and keeps growing too.

The first control they could offer is to limit the 125k model context length to a smaller number, to which algorithms adapt. I replied to OpenAI staffer with a more extensive list of API controls possible, describing to him by example schema what they don’t document…

Solution: Use chat completions. Inject via RAG only that documentation that has contextual relevance to the present need. Then you can even make more cheap model calls to ask “should I summarize this?”, “should I discard this?” against past chat and injection by means of writing a chat context manager, using a combination of embeddings and language AI and token counts.


A simple solution (No code) would be to use a third-party service like dify.ai

It has a lot of cool features and offers better control.

wow man you always reply back to me and wow me each time, how can i thank you

I’m gonna stick to using LLamaIndex / langchain for the time being, assistants API is milking my API keys

1 Like