I am seeing a huge issue with a very basic implementation of the Assitant API on the gpt-3.5-turbo model.
Background-Information to our use case: We try to analyse documents and extract information. For this, we created one assistant with instructions on what to do with the document (document is given over user message).
For every document (roughly 6000 in total) we do the following:
- Create a thread on the assistant which we created over the playground
- Add the “user” message to the thread. The user message is the markdown of the document we want to process.
- We start the “run” on the thread
… process next document
We are not waiting for these runs to get finished but proceeded with our batch processing of the other documents and reap the messages later.
By Accident, we used gpt-4 model in the beginning, but the cost for every run was too high, so we switched to “gpt-3.5-turbo-16k” after roughly 800 documents (and burning 80 USD which was clear to us and not a problem).
After the switch to gpt-3.5-turbo-16k we noticed a very steep increase in cost and a usage of roughly 35,000,000 (yes, millions) tokens used.
We calculated the test before and thought that the cost would be roughly 8000 tokens input + 2000 tokens output => 10000 tokens per document X 5000 jobs (documents/runs) which would be roughly 0.012 USD/document and 60 USD in total.
After 10 minutes, I noticed that the cost was already over 100 USD and only about 707 documents were processed. I got shocked, stopped the process and took some time to dig deeper.
I used the API to check for all threads, the messages created under these threads and the runs themself.
With gpt-3.5 mostly all runs failed. Only 33 out of the 707 had the status “completed” without any issues.
A lot of the runs produced up to 21 messages and eating through the tokens! Normally, there should be only one message “response” per run! So, in total, only 2 messages per thread and 1 per run! On average, we had 7.2 messages per run.
- Runs with status “cancelled” and a lot of messages created
- Runs with status “completed” but multiple (up to 21 ) very similar messages generated. All messages have the same run_id. The messages look good and I have no idea why the model re-generates them.
- Runs with status “failed” but no messages with “last_error” = rate_limit_exceeded: Rate limit reached for gpt-3.5-turbo-16k in organization org-JfHjVQvS0VBIEGRAYcgitAoO on tokens_usage_based per min: Limit 1000000, Used 995047, Requested 14138. Please try again in 551ms. Visit https://platform.openai.com/account/rate-limits to learn more.
Most likely these runs are not accepted by OpenAi since other runs are still pending (and generate messages over and over again without any obvious reason). It us unclear why this happens. Messages under a thread have the same run_id.
- What is wrong here? Why does it work with gpt-4 but not with the gpt-3.5 model?
- Why is the model creating multiple messages for one run?
- Is it normal that messages get created several times during a run?
- How can you prevent the model from re-creating messages several times?
- Will I get my money back?
- Why should I pay for a failed run? We even see chat completion timeouts on non beta products and have to pay for them!
- Who should I contact to get a refund for the tokens?
- Is this normal for a beta software? I am a developer since over 20 years but never ever have seens something like that.
Here a screenshot of my excel file which gave me insights on the threads and produced messages.
Was using the API to create this report myself.
You can see that there are no issues under gpt-4 and as soon as you switch to gpt-3.5 it starts to randomly re-create the messages over and over again.
On this screenshot you can see that the same run_id is producing multiple messages.
I can provide all details, including my excel to the devs and help to dig deeper.