How exactly do you get charged for using the API for assistants?

They keep updating it, I think they’re looking for the sweet spot. Currently tiers 1-5 have a 10,000 RPD limit.

Frist, does message history provide a material benefit to your use case? if so consider reducing that to half, remember we are dealing with compounding data here, so a reduction in half would make a much larger reduction.

If message history has no value… don’t use it.


If message history has no value… don’t use it.

Simple, yet effective

1 Like

Please explain your method for doing so.

If you don’t get a context completely filled with messages, by restarting so the assistant AI can’t even answer “what about the other one”, then any retrieval will also make sure that the context is also filled before the AI is set loose iterating on function calling against your API or code interpreter.

1 Like

The thread object and the messages it contains can be modified, so just reduce all messages to 50% of what they were, as in, loose the top 50%. Next time you perform a run, it will be less data to process. Context may be lost, but that is a cost of token reduction. Do this if the messages are … lets say >4096 tokens worth.

1 Like

The models tend to repeat the instructions they were given anyways, you could fairly easily have a max-token context or a maximum number of prompt-response interactions (maybe you only want the most recent 3 back-forths to be included). I agree with @Foxabilo

1 Like

Yea, there may be some very easy to implement check for duplication like that…

1 Like

Assume I am really stupid, but only just read all the documentation and API reference.

How do I simulate what locally is just system + chat[-turns*2:] + user to pass the number of turns that fit in my token budget?

No need to answer hastily, I’ll give you all the time you need.

Then we have the utility of “answer not ready yet, still looping calling your API endlessly with the same thing”.

def truncate_messages(thread):
        max_length = 4096
    truncated_thread = []

    for message in thread:
        # Truncate message to 4096 characters if it's longer
        truncated_message = message[:max_length] 

    return truncated_thread

Not tokenised, but you get the idea.


Just to clarify: we manage our conversations anyway, ignoring the thread feature?

Because you try to send some actual user/assistant conversation into a new thread, I expect you will get a nice traceback

openai.BadRequestError: Error code: 400 - {‘error’: {‘message’: “‘assistant’ is not one of [ ‘user’] - ‘messages.1.role’”, ‘type’: ‘invalid_request_error’, ‘param’: None, ‘code’: None}}

(simulated, even though errors are free)

Assistants have no other way to receive conversation history except by specifying a thread. It can receive an “instruction” upon creation. Or it can receive “user” input messages.

Will it believe you when you say:

user: (no this is actually what you said)


I’ve not tried it, are you saying that there is 100% no way to modify an existing message? or manage message content in a thread at all? If that is so then that seems like an oversight.

Anti-multishot technology ™, because the API developer is an adversary not to be trusted.

1 Like

so that way i’m doing it is retrieving or using the same thread in an existing conversation. i noticed that IT IS quite expensive. is it correct that everytime you’re using the same thread from first to latest of the conversation will be charged in tokens?

let currentThreadId = null; // Global variable to store the current thread ID
let conversationHistory = []; // Stores the history of the conversation
const assistantId = '***';

async function askGPT(question) {
    try {
        conversationHistory.push({"role": "user", "content": question});

             // Retrieve the Assistant
             const myAssistant = await openai.beta.assistants.retrieve(assistantId);
             console.log(myAssistant); // For debugging

        if (!currentThreadId) {
            // Create a new thread if there isn't an existing one
            const thread = await openai.beta.threads.create();
            currentThreadId =;
        } else {
            // Optional: Retrieve the existing thread (for verification or additional logic)
            // const thread = await openai.beta.threads.retrieve(currentThreadId);

        // Add a message to the thread with the user's question
        await openai.beta.threads.messages.create(currentThreadId, {
            role: "user",
            content: question

        // Run the assistant to get a response
        const run = await openai.beta.threads.runs.create(
            { assistant_id:  assistantId }

        let runStatus = await openai.beta.threads.runs.retrieve(

        // Polling for run completion
        while (runStatus.status !== 'completed') {
            await new Promise(resolve => setTimeout(resolve, 1000)); // Wait for 1 second
            runStatus = await openai.beta.threads.runs.retrieve(currentThreadId,;

        // Retrieve the messages after the assistant run is complete
        const messagesResponse = await openai.beta.threads.messages.list(currentThreadId);
        const aiMessages = => msg.run_id === && msg.role === "assistant");

        // Assuming the last message is the assistant's response
        return aiMessages[aiMessages.length - 1].content[0].text.value;
    } catch (error) {
        console.error('Error in askGPT:', error.response ? : error);
        return 'An error occurred while processing your request.';

is it much cheaper to create a new thread on every input and output? and don’t remember conversation anymore to lessen the usage of tokens?

1 Like

Which of course really nullifies most of the benefits. This API is ridiculously expensive.

Nope. No way. We have no control over messages at all besides modifying metadata, adding a new user message, and attaching assistants.

100%. There is no way this is ready for production. It’s incredibly slow, expensive, and lacking in a lot of features.

I have hopes for the future though.


Well, this is what beta development is about, testing the product and refining, I can see a context limiting setting becoming a thing and also a message length and size limit setting as well.

Constructive feedback forms an essential part of the development cycle.

1 Like

Am I right that there is the ability to limit tokens on the Chat API, but not on the Assistant API?

imho, this is generic functionality that many of us have built in various ways locally already to get around the fact that no-one had yet offered a decent “assistant API” to date.

However, this version was poorly thought out, too simplistic and should have better anticipated client sensitivity to cost. Paying ~25c per call is ridiculous and will not fly in Production.

It would be nice to have a more open approach to improving this API.

The developers should:

  • summarise the feedback they’ve received so far
  • inform us what improvements they are looking into … and then
  • update us on the new version when it becomes available.

Perhaps that’s in progress already and someone could share the official comms?

1 Like

Thats not a beta release. This is not even alpha. Its not working as intended.

Check out my bug report:

Why is it generating messages over and over again? Here is no thread context to be aware of. To be honest, it feels like a big rip-off and make some extra money.


I only just figured this out tonight. I’ve been busy building out additions same when dumping the returned data, I saw this happening. I thought it was my local json being appended until I deleted it. It seems as though you may be better off generating a new thread at every call and grabbing the last returned messages to append your new message with. Keep track of the last used and current thread id. Would that not get the same response? I have to try. It would reduce cost drastically if it adheres to the message structure the same way. I’ll have to experiment.

1 Like