Assistants API (gpt-3.5-turbo-16k) usage exceeds limit due to message loop

I am seeing a huge issue with a very basic implementation of the Assitant API on the gpt-3.5-turbo model.

Background-Information to our use case: We try to analyse documents and extract information. For this, we created one assistant with instructions on what to do with the document (document is given over user message).

For every document (roughly 6000 in total) we do the following:

  1. Create a thread on the assistant which we created over the playground
  2. Add the “user” message to the thread. The user message is the markdown of the document we want to process.
  3. We start the “run” on the thread

… process next document

We are not waiting for these runs to get finished but proceeded with our batch processing of the other documents and reap the messages later.

By Accident, we used gpt-4 model in the beginning, but the cost for every run was too high, so we switched to “gpt-3.5-turbo-16k” after roughly 800 documents (and burning 80 USD which was clear to us and not a problem).

After the switch to gpt-3.5-turbo-16k we noticed a very steep increase in cost and a usage of roughly 35,000,000 (yes, millions) tokens used.
We calculated the test before and thought that the cost would be roughly 8000 tokens input + 2000 tokens output => 10000 tokens per document X 5000 jobs (documents/runs) which would be roughly 0.012 USD/document and 60 USD in total.

After 10 minutes, I noticed that the cost was already over 100 USD and only about 707 documents were processed. I got shocked, stopped the process and took some time to dig deeper.

What happened?

I used the API to check for all threads, the messages created under these threads and the runs themself.
With gpt-3.5 mostly all runs failed. Only 33 out of the 707 had the status “completed” without any issues.

A lot of the runs produced up to 21 messages and eating through the tokens! Normally, there should be only one message “response” per run! So, in total, only 2 messages per thread and 1 per run! On average, we had 7.2 messages per run.

Following Issues I observed with gpt-3.5 model:

  • Runs with status “cancelled” and a lot of messages created
  • Runs with status “completed” but multiple (up to 21 ) very similar messages generated. All messages have the same run_id. The messages look good and I have no idea why the model re-generates them.
  • Runs with status “failed” but no messages with “last_error” = rate_limit_exceeded: Rate limit reached for gpt-3.5-turbo-16k in organization org-JfHjVQvS0VBIEGRAYcgitAoO on tokens_usage_based per min: Limit 1000000, Used 995047, Requested 14138. Please try again in 551ms. Visit https://platform.openai.com/account/rate-limits to learn more.

    Most likely these runs are not accepted by OpenAi since other runs are still pending (and generate messages over and over again without any obvious reason). It us unclear why this happens. Messages under a thread have the same run_id.

My Questions:

  • What is wrong here? Why does it work with gpt-4 but not with the gpt-3.5 model?
  • Why is the model creating multiple messages for one run?
  • Is it normal that messages get created several times during a run?
  • How can you prevent the model from re-creating messages several times?
  • Will I get my money back?
  • Why should I pay for a failed run? We even see chat completion timeouts on non beta products and have to pay for them!
  • Who should I contact to get a refund for the tokens?
  • Is this normal for a beta software? I am a developer since over 20 years but never ever have seens something like that.

Here a screenshot of my excel file which gave me insights on the threads and produced messages.

Was using the API to create this report myself.
You can see that there are no issues under gpt-4 and as soon as you switch to gpt-3.5 it starts to randomly re-create the messages over and over again.

On this screenshot you can see that the same run_id is producing multiple messages.

I can provide all details, including my excel to the devs and help to dig deeper.

5 Likes

On the usage page, the money amounts shown are not live. They can be delayed by several hours or even trickle in after a day.

You therefore may be misattributing charges originating from a different period of use.

You can choose the “activity” tab and see the token counts of models, but only by the day. You will have many models hidden though, which need to be selected with another drop-down window.

You have done the work others won’t of producing logs. Additionally, you can go into run steps and dump out that data.

The purpose of this new usage page that came out alongside the assistants, replacing one where you could see the number of calls and tokens in five minute increments is clear: “We know Assistants are a broken product that will empty customer accounts with iterations and uncontrolled context loading of any content they can get.” “Customers must not be able to see how they’ve been defrauded by our promises”.

You can share your experience wide as you’ve done, where I might as well compile the dozens of forum reports that are similar. To get reimbursed, the avenue you’d have to pursue is by sending a message via the help.openai.com assistant, and first navigating its own questions.

2 Likes

Thank you for the response. I can be 100% sure that it was caused by this “import” script since we never used the GPT-3.5 Model before. It also stopped to increase in usage roughly 30mins after stopping the script. Thank you for the advise with the steps… Will enrich my Excel with this data.

I also wrote over the chat but I doubt we will see an answer here. Will try to reach out to some people on linkedIn to see if we can get in contact with someone and find a solution. Seems that this product is not even an alpha release. Everything I touch on this new assistant API is highly flawed and costs a lot of money.

Developing on the platform should not be that time consuming for pros but straight forward, even on a beta release.

2 Likes

Can’t believe how little attention this post got.

I’ve read it the other day, from a device I couldn’t log in and came here expecting an interested discussion.

Thank you for sharing!

1 Like

I would say many people are wondering about their high usage but do not analyse where it comes from. Hope it gets some more attention and open AI is giving us our money back and is fixing it. The system is not usable at all

1 Like

Hi there – sorry you’ve been having trouble with this API. I would love to dive deeper into this with you to figure out what went wrong.

To answer one of your questions: It’s normal for the model to create multiple messages per run as it “thinks” through the problem. There’s no way to prevent this from happening at the moment. But happy to look into what happened here in your specific use-case to provide a better explanation.

If you could email me at nikunj@openai.com with your excel file, I’ll take a deeper look.

Hi Nikunj.

Thank you for your reply and the attention that OpenAI is giving to this issue. I was sending you the Excel report file over E-Mail. Please let me know how I can assist you. Will keep this thread updated for others as well.

I want to emphasize that everything works smoothely when GPT-4 is used. Of course its even slower then but at least it does not fail and bottom line is cheaper than the GPT-3.5

2 Likes

I experienced a similar problem using gpt-3.5-turbo-16k. While my costs were significantly lower (I stopped before it reached USD$4.00) I did recognize that the model was submitting the same tool-call dozens of times, then failing, or forcing me to terminate the script before it kept looping. The more tokens were involved in processing the request, the more likely it was to lose direction.

Here’s one hypothesis (though it’s entirely untested and unconfirmed):

  • We set up the Assistant with its context and initial message.
  • The Assistant runs tools to gather additional context.
  • Those tools return too much data.
  • Since that data is newer than the initial prompt, the initial prompt and request gets bumped from the LLM context window.
  • Without the initial prompt and request, GPT loses track of its original instructions, and gets lost like an ant who has lost its scent trail and can’t find home again. :cry:
  • We get billed for all the tokens it uses while on its fruitless journey.

The more tools we provide to the Assistant, the more tokens are required to describe them all. The more data a tool returns, the more tokens it uses. The more data is returned by the LLM, the more tokens. Etc, etc.

Since the Assistants feature seems to be black-box technology (I’m unaware of any technical documentation of the inner workings of the OpenAI Assistants API) I’m unable to confirm whether this is the case.

However, I have heard that the latest GPT-3.5 model (gpt-3.5-turbo-1106) has “improved instruction following [and] parallel function calling” so perhaps it will be better-suited for this task?

That said, @adaptiv, considering the size of the documents you’re processing, you may be hitting that context limit and losing context quickly. It may be the case that, for your purposes, the gpt-4-1106-preview model may be better-suited, as it has a significantly larger context window.

Yeah, I am in contact with @nikunj form OpenAI and we will check that. Our input tokens are roughly 6000–8000 tokens. There should be plenty of tokens left in the context window for the answer. And you are right, GPT-4 handles it well. But 3.5 should be capable too. Otherwise, there should be a fail-safe from OpenAi to prevent me from using the model. I can clearly see that the generated messages are fine, but it re-generates them over and over again. They are not 100% identical due to the internal seed.

@nikunj I’m using a Chatbot with GPT- 3.5 Turbo model on my website to facilitate my site’s visitors. Since 17 Nov 2023, the ChatGPT API has been charging me 6 to 10 times more than usual. The usage of tokens by visitors of my site is as it is as it was before 17 Nov. Whereas the usage of tokens shown on the API usage page of my OpenAI account is 6-10 times more. I contacted the OpenAI support team on 18 Nov via the online Help option of the OpenAI account, but I haven’t received any response from them yet.

I have tried revoking the previous API key and generating the new ones to use, tried base models too, and even tried a completely different account to generate the API key but of no use. In all the cases, API is charging me extra and generating extra tokens usage than they are actually used.

Please refund the extra charges OpenAI charged me and resolve this issue soon permanently because without getting this issue of extra costs and token usage resolved, we can’t keep our projects, based on ChatGPT API, continue.

And I hope you understand that the low performance of our projects due to this problem is not only causing us financial loss but the loss of the trust the users of our projects have in us.

1 Like

@nikunj
I didn’t receive any response, neither from nikunj@openai.com nor from online Help option of OpenAI. I’m still facing the issue, it’s causing me a huge loss in many layers as it’s badly affecting my live project.

I am contacting OpenAI since 18 Nov but no response from them till now, and this is quite unprofessional on their part.

Hi,

Staff to support at this level are very limited in number and they are working with adaptlv at the moment to locate the issue, hopefully a resolution is found soon.

1 Like

Are you OpenAI staff member?
How do you know staff " working with adaptlv at the moment to locate the issue"?

It seems that they’re not even acknowledging there is an issue.

1 Like

I do not work for OpenAI, I am however terminally online and on this forum, so I read every single post, I saw a member of OpenAI staff post a reply to their issue and also that they had been communicating about it.

Hopefully that is a sufficient explanation of my comment.

2 Likes

@jforte
@rohitoai
@logankilpatrick
@rohancalum
@brittany_oai
@michellep
@shyamal
@abw
@mario
@jessica.james
Please resolve this issue. Thanks

I have exactly the same issue. Has anyone resolved it yet ?

Your “exactly the same issue” is likely not “exactly the same issue”, as others like you have jumped on this thread, which is about unaccountable assistant usage, and diverted it to other symptoms and undescribed uses.

  1. Are you using assistants via API?
  2. Are you allowing very long thread conversations?
  3. Are you using the retrieval tool?
  4. are you not using gpt-4-1106-preview or gpt-3.5-turbo-1106?

While each of thes will increase autonomous token usage outside your control, all together are a recipe for account-emptying disaster. Only the latter can produce the tool output which retrieval functions require.

then

  1. are you using a world language with considerable upper UTF-8 or Unicode characters, such as accented letters?

That also will cause tool failures due to issues with the model.

Don’t use assistants until their autonomous iteration, errors, and maximum use of model context length are under control by future OpenAI mitigations.


or

  • have you generated a new API key and deleted all others, changed your password, kept that API key private and desisted from API use enough to see no charges for a day?
  • Do you put that API key back into your poorly-coded client application or site where hackers could extract it and instead see incredible usage, models you didn’t specify?
  1. Yes I’m using assistants. Both API and playground produce same results - infinite message loop (max: 21 message per thread).
  2. No, it starts sending messages after first or second message.
  3. Yes.
  4. Both gpt-4-1106-preview and gpt-4 produced same results.
  5. Yes. However I will try without retrieval tool soon.

If you are able to code for the complex machinations of assistants, you can also chart a path that is already forged by others: Chat Completions, gpt-4 or gpt-3.5-turbo, application server conversation management, embeddings vector database semantic search of documents, programmatic extraction of files, function calls to indexed browsing, function calls to external knowledge enhancements and actions, function call to python interpreter sandbox execution environment…and those methods used in exactly the manner that suits your application.

If the assistants function loops due to failure the AI isn’t correcting, and another AI model just fails differently, some words in instructions won’t be able to do more than make the AI consider the useless method “off” just as if you had never enabled it.