How exactly do you get charged for using the API for assistants?

I racked up $200 over an hour of running chatgpt on some data it was cleaning. I had an assistant with probably 5 paragraphs of text in the intro box, and 1 paragraph of content on every request limited to 8000 characters.

What goes into the calculation for the tokens. I see $0.03 for chatgpt-4 which is what I was using, but does my 5-paragraph intro text go into the calculation every request? Or does the response data size count against me? What exactly goes into the calculation, I can’t figure it out.

At first I was using gpt 3.5 and was getting charged $1.50, $2, then I switch to gpt4 and run it for an hour and I’m at $200.

Everything in the conversation goes to the model every time you make a call, that is how the conversational aspect of the model works. The model is stateless internally and so needs to be fed all of the context required for that query, that means all of the prior context. A+B , then A+B+C then A+B+C+D then …

Does this mean the entire message history is fed in every request, so each request gets larger and larger each time?

Edit, OIC, that was it then… Dang… A+B , then A+B+C then A+B+C+D then …

It doesn’t work well without the message history…

Yes, up to a maximum determined by what model you are using, then the system will truncate the message history and loose things from the start, if you are using a 128K model, this can be 300 pages worth of text before truncating.

You can of course handle the thread yourself by removing elements and keeping it smaller, but you may loose contextual accuracy and awareness.

Maybe you could get away with only giving it some snippets of the very long sections in the message history?

If I am using message history, then it is going to max out my token usage after a few calls, right? So 8192 tokens every call, 0.03 * 8 = 25 cents per input, 50 sents per output. That is basically 75 cents a call! Am I doing this correctly?

At that point its cheaper to hire a human to do what I’ve been doing lol.

1 Like

count your input tokens for the entire call, divide by 1000 then multiply by 0.03

Here is the calculation, yep. 8000/1000 = 8, 8 * 0.03 = 25 cents.

What is your technique for limiting the request size? If I don’t use message history anymore, does the intro text count against me for the assistant? Would it be better to just go back to the completions API?

(How do I trim message history size even)

It sounds like you’re using gpt-4 rather than gpt-4-1106-preview? The preview model is around 2.75 times cheaper, and it can take a higher Tokens Per Minute (probably 300,000 depending on your usage tier).

You don’t need to add the entire conversation into your new prompt as context either, just append the last message on.

        if prompt := st.chat_input("What is up?"):
            st.session_state.messages.append({"role": "user", "content": prompt})
            with st.chat_message("user"):

            with st.chat_message("assistant"):
                message_placeholder = st.empty()
                full_response = ""
                for response in
                        {"role": m["role"], "content": m["content"]}
                        for m in st.session_state.messages

                    # Extract the message content using the proper JSON keys
                    # Access the first 'Choice' object in the 'choices' list.
                    choice = response.choices[0]  # 'choices' is a list, so you can use index access here.

                    # Access the 'ChoiceDelta' object from 'choice' which contains 'content' field.
                    choice_delta =  # 'delta' is an attribute of 'Choice' object.

                    # Access the 'content' field from 'choice_delta'.
                    message_content = choice_delta.content  # 'content' is an attribute of 'ChoiceDelta' object.

                    # Append the extracted content to the 'full_response' string with a newline character
                    full_response += message_content if message_content is not None else ""

                    # Update the placeholder with the 'full_response' using Streamlit's markdown to render it
                    message_placeholder.markdown(full_response + "▌")

                    #content = response.choices
                    #full_response += content.text.value + "\n"  # `.text.value` instead of content["text"]["value"]
                    #full_response += response['choices'][0]['message']['content']
                    #message_placeholder.markdown(full_response + "▌")
            st.session_state.messages.append({"role": "assistant", "content": full_response})

This is how I do mine for my streamlit interface (it’s like GPT Plus for my co-workers without actually getting GPT Plus :sunglasses:)

Ignore all of the #'d out notes, I’m bad at coding lol

Where can I find a JS variant haha, python is hard to parse in my head atm.

gpt-4-1106-preview has a 200 requst per DAY limit I think, which is why I switched. I need to make 10,000 requests.

You cannot “handle the thread yourself”. With what, the “truncate chat” feature not offered? You can add more metadata, and it is unclear if this is just more tokens for the AI to read and ignore.

You can only put user messages in, only the AI can write assistant messages, a hard firewall against utility and creating a smaller conversation.

The role of the entity that is creating the message. Currently only user is supported.


Assistants will empty your account by design



They keep updating it, I think they’re looking for the sweet spot. Currently tiers 1-5 have a 10,000 RPD limit.

Frist, does message history provide a material benefit to your use case? if so consider reducing that to half, remember we are dealing with compounding data here, so a reduction in half would make a much larger reduction.

If message history has no value… don’t use it.


If message history has no value… don’t use it.

Simple, yet effective

1 Like

Please explain your method for doing so.

If you don’t get a context completely filled with messages, by restarting so the assistant AI can’t even answer “what about the other one”, then any retrieval will also make sure that the context is also filled before the AI is set loose iterating on function calling against your API or code interpreter.

1 Like

The thread object and the messages it contains can be modified, so just reduce all messages to 50% of what they were, as in, loose the top 50%. Next time you perform a run, it will be less data to process. Context may be lost, but that is a cost of token reduction. Do this if the messages are … lets say >4096 tokens worth.

1 Like

The models tend to repeat the instructions they were given anyways, you could fairly easily have a max-token context or a maximum number of prompt-response interactions (maybe you only want the most recent 3 back-forths to be included). I agree with @Foxabilo

1 Like