Assistants API Pricing and Token Usage

I did the same with a 50k token document, that was a very expensive conversation


Is there a way to limit the context token usage? When running an assistant as an AI chat bot with conversation history, it really chews through those tokens like nothing else.

It sounds really unsustainable to me, since longer conversations start to eat up more tokens and the developer has very little control over it outside of limiting user interactions with the bot.

1 Like

Even before the Assistants API, most people used other libraries like langchain to get this kind of higher level abstraction (and others). However, one issue with such higher level abstractions in those libraries was that they chose not to expose a lot of the underlying details and also did not provide hooks into all the things people might want to override/customize within the pipeline for each higher level class. It is definitely possible to provide higher level abstractions with more transparency around usage, override hooks for everything in the pipeline.
Assistants API provides an awesome powerful higher level abstraction. And it is great they have made it available early for people to try out. I sincerely hope that they follow it up with the transparency and hooks needed (as the staff member mentions above in this thread, and one of the Dev Day speakers mentioned supporting different strategies in the future). Based on this thread and my own experiments with the API, using this in production will be much more viable once those things are added.

1 Like

If you’re looking for a significantly cheaper and more customizable option, check out GitHub - transitive-bullshit/OpenOpenAI: Self-hosted version of OpenAI’s new stateful Assistants API

It’s 100% spec compliant with the official OpenAI OpenAPI spec for all Assistant-related resources, so you can use it with the official OpenAI sdks at a fraction of the cost.

The current version still uses OpenAI’s chat completion API under the hood to handle tool invocation and responses, but I plan on adding support for more model providers shortly.


Also in the case that a session covers all the code of 1 group of instructions, what are the limits :
Is it 1 session per run ?(And in that case you better not need many calculations)
Is it 1 session per thread ?
And in that case does the session cover all the assistants in the tread ? (highly doubt that)

1 Like

Could you explain how it works and how does it reduce so drasticaly the tolken usage ?

It allows you to self-host the Assistants API, so you won’t pay excessively for storage and retrieval. Off the shelf AWS S3 (or Cloudflare R2), Postgres, and Redis are all pretty cheap, and unlike the official Assistants API, the only thing you’ll pay for by volume is the underlying OpenAI chat completion API calls.

This is meant for advanced developer use only, since it takes a bit of work to set up and is an OSS project built on top of a Beta API. See the GitHub readme for more info.

Maybe try using tiktoken ?

import tiktoken
from openai import AsyncOpenAI

async def classify_sentiment(phrase, model="text-davinci-003"):
    client = AsyncOpenAI()  # Initialize the client

    # Initialize the tokenizer for the model
    enc = tiktoken.encoding_for_model(model)

    # Tokenize the phrase and create a prompt
    tokens_phrase = enc.encode(phrase)
    prompt = f"Classify the sentiment of this text: '{phrase}'. Is it positive, negative, or neutral?"
    tokens_prompt = enc.encode(prompt)

    # Make the API request
    response = await client.completions.create(prompt=prompt, model=model)
    response_text = response.choices[0].text.strip()

    # Tokenize the response
    tokens_response = enc.encode(response_text)

    # Print the token statistics
    print('-' * 60)
    print(f"Phrase: {phrase}\nNumber of tokens in phrase: {len(tokens_phrase)}")
    print(f"Prompt: {prompt}\nNumber of tokens in prompt: {len(tokens_prompt)}")
    print(f"Response: {response_text}\nNumber of tokens in response: {len(tokens_response)}")
    print(f"Total tokens used: {len(tokens_phrase) + len(tokens_prompt) + len(tokens_response)}")
    print('-' * 60)

    return response_text

async def main():
    phrases = [
        "I love sunny days",
        "I hate when it rains all day",
        "It doesn't matter if it's raining or sunny",
        # ... more phrases
    tasks = [classify_sentiment(phrase) for phrase in phrases]

    # Gather and run tasks concurrently
    await asyncio.gather(*tasks)

# Assuming this code is within a Jupyter notebook
await main()

1 Like

@oieieio This is an entirely different beast.

You are using the old(:joy: ) .completion paradigm.

We are discussing the new .beta.assistant where it’s all abstracted from you, which is great, but you lose control about what makes it into the context…

As an example:
I’ve been building my own wrapper around it and a terminal UI so far

An afternoon conversation with not that much messages costed me $15, I didn’t even use retrieval.

It’s so freaking cool tho, the only thing preventing you from having Jarvis now is you, we just need a bit more control and clarity.

1 Like

ok, I just trying to be helpful.


You said it! I was able to get gpt4 to reason in 3d space coherently using an assistant and code interpreter. This was one of the last big hurdles, my Jarvis is well underway and looking epic.

To make things work really well I run two concurrent conversations and compare them as you go. This works incredibly well. 128k context with an assistant, 2 conversations (main and shadow) and a local vectordb of the messages with multiple summaries of the ideas being discussed in the conversation? It would do everything I want it to. I have no doubt about this. However, as much as I want to continue working on it, I couldn’t even afford to test it the way things are right now.

edit: Haha, had the same thing. Sent roughly 4k tokens total and after a few brutal runs usage was 600k tokens.

1 Like

it’s $0.2/GB/Day PLUS the amount if input / output tokens consumed . I don’t think he will charge on the RAG context retrieved as we have no control on that internally so far . so if you have a total knowledge base 1 GB for retrieval , each day you will pay$0.2 irrespective as a base cost , and then per input output tokens for all the thread messages . that’s what I understood

The assistant api was looping infinitely to get the function response even though it has been provided. And I got this message in the terminal:

I was surprised like why this error. I was seeing this error for the first time. And I thought that openai is down again :frowning:

But when I looked in my dashboard I have already exhausted the my $20 limit :sob: And when I looked at the credits usage, man :rage:

I thought that I would use maximum of $20 in a month to build my project. But I don’t think anyone’s project will survive out there if they are using the assistant api :frowning: it’s too costly.

If OpenAI want to bring business to them using assistant api then they should optimize their assistant api by:

  1. bringing its response time down to under 1 second,
  2. bring down the cost or maybe allow the users to add limits to the bot in terms of max input and output tokens usage.
  3. They should also optimize their assistant api function calling feature since I don’t think it is working fine.

I think an open source assistant api is the only way ahead to build a production ready realtime chatbot application which is cost effective, highly customizable, optimized for speed and we can choose resources to use like database, file storage, caching and code interpreter. :slightly_smiling_face:

May be it will get better when google and other competitors step into the market along with their products. :roll_eyes:

What do you guys think?


Your approach to start with open source and move to more effective and expensive models as you solidify your project direction is solid.

If you’re doing active development built on an assistant, or any other API that incurs costs for that matter, you could be caching output and stubbing your calls to the API while you work on everything else about your application to save money.


6 million tokens!!

Wow, that’s wild.

You can monitor the steps and stop the run in situations like this. If it’s an api thing add error handling, but if it’s the assistant making too many calls you can probably just add ‘If you have to try anything more than 3 times stop and consult the user.’ to your assistant


I think they meant api, not model? I’m guessing for testing things like how a run and steps work.

Maybe something like Langchain? Works a bit different and does much more, might be overkill. Shoot, if it’s not too complex these days you can just ask gpt4 to write a queueing system and you’re good.

1 Like

Yes, and still the current version of assistant api is not ready for production usage. I have added 2 functions which I want the assistant to run. Every time the function output is submitted to the run, the run status changes from ‘requires_action’ to ‘queued’ and then this loop went on for 5 to 6 times until my 30s time limit triggers. Which eventually cancel the run on the thread. I checked the code multiple times and the run status never went to any other states.

Maybe it can be time consuming to build a perfect open source assistant api which can be use with any LLMs out there whether it is from openai or google or custom. The main thing is to manage the context window I think. All the other things like queuing, storage, caching, streaming is not that hard to build.

1 Like

I use chromadb and gpt3 summaries to manage the conversation history and assemble a unique message list for each exchange that points everything in the right direction. This message list is tuned to get gpt4 to give the best response possible. What you see in the chat is your message and the replies, but the actual conversation is different. Keeps the conversation focused and in the context window, works great.

(You case sounds like maybe the outputs aren’t making it over?) The assistats api isn’t bulletproof for sure but it does work with relative consistency for me. If it bombs out now and again I figure it’s the api, but if it’s constantly failing it’s probably me.

1 Like

Assistants API options:

  • Gpt3.5 - very bad at function calling when you add complexity.
  • GPT4 Turbo - 128k tokens and no methods to limit them.

So both options are flawed.

Also to note:

  • The way of retrieving file contents is unnecessarily complex(unless I’m missing something)
  • The polling kills me.

I think it’s fair since it’s in beta… but with all the drama I’m not sure nothing is getting done anytime soon, I hope I’m wrong :smile:

I served users through the previous chat completion endpoint using more expensive per token models, had no issues regulating cost, but this one for now is not yet ready for production in my books…


Yep, 3 isn’t smart enough and gpt4 runs can create hundreds of thousands of tokens.

I think there is too much momentum for this to just end, and I can’t see things stagnating, we’re in the middle of a revolution. I’d bet in the next weeks and months the assistants api will improve, possibly dramatically if recent history is anything to go by.

You said it, the term ‘beta’ literally means software not ready for production.