Gpt-3.5-turbo-1106 is very slow

@TonyAIChamp In my case:

  1. the response format is set to json_object
  2. I also have JSON output being mentioned in my prompt
  3. Temperature is set to 0.2.

If it was a problem with the prompt, wouldn’t it fail all the time and it’s a timeout and not an error. In your shared sample, you set timeout to 5, what is the unit there?

If it was a problem with the prompt, wouldn’t it fail all the time and it’s a timeout and not an error.

Actually, no. Please check above: Gpt-3.5-turbo-1106 is very slow - #12 by TonyAIChamp

In your shared sample, you set timeout to 5, what is the unit there?


@TonyAIChamp Could you share the prompt that works without timeout?

My system prompt:

Reply in JSON format with the following structure:

  “title”: “title of the book”, 
  “author”: “name of the author”
// other instructions

model: gpt-3.5-turbo-1106
temperature: 0
response_format: {‘type’: ‘json_object’}

Still get timeout every few times.

Are you streaming the answer so that you can see the progress?

Because now you can’t see the progress every five minutes in the usage page to determine what you were charged in tokens, as the prior features were intentionally attacked, an obfuscating modification to existing functionality meant to harm the user and make ill-conceived products such as “assistants” unaccountable.

If you aren’t streaming the completion, then you won’t see that the AI model has gone nuts in a loop of nonsense tokens until it fills up the context length, until it times out on you without returning an answer.

Setting your own timeout will ensure you do not see the 4000 “/n/n/n” that you paid for. The likely result of a new AI model on the edge of breaking all the time, so much that it was “trained” to make json but still needs to be instructed in order to not produce garbage.

I would see if you can just prompt for the JSON output that you want the AI to produce, and NOT use the separately-trained json-output model that doesn’t work right, and which is invoked by the response_format parameter.

1 Like

Yes, here my bug report which focuses on the excessive usage of tokens due to messages re-created over and over again.

                You always reply in the json format with the fields:
                - ai_message: your message to the Human
                - status: assessment status
                - status_comment (very detailed comment to the status, not to be shown to the Human)

This is one of the examples. On some other ones I show actual json like you do.


I like this idea of providing the JSON structure. Might be more intuitive to the model than a bogus JSON output.

1 Like

Hey, so i am currently creating batches and running requests async. Our company is t4 and im trying to get closer to the 1mio token per minute limit.

The issue im experiencing are very long response times for single requests in the async batch (several minutes). This currently exclusively happens for the new 1106 model. GPT-3.5-turbo-0613 does not run into this issue and i have a much higher throughput (i would gues roughly 4x?).

Now to my question, since the new model is slower. Will this change? And also are the new 1106 api prices also the new prices for GPT-3.5-turbo-0613, or is 0613 currently more expensive than 1106?


Hello, I have a similar issue where I have a new use case that only 1106 can do well, but I’m unable to migrate to 1106 because about 20% of the time the responses will take over 20 seconds. For 0613 and 0301, this only happens 1% or less. What is the ETA for the fix for this?

This will get better with time, more hardware and software refinement, at this stage it’s all beta and should be treated as such, while beta development can be difficult it does give you access to high value markets with little in the way of direct competition… at least to start with and the first to market advantage is often enough to carry products over the line for the long term.

Those are feel-good words.

We only hope it would get better since introduction - like one might have only been able to foresee GPT-4 getting better since its initial release.

1 Like

work’s fine for me


I cannot stop being amazed how people are complaining about the magic wand they were suddenly given, because of the slighly wrong shade of the color of free socks they wish from the wand…


So this is still happening, I switched back to the old 3.5-turbo and it doesn’t have the problem.

Well, OpenAI is a company that offers its services and asks for payment, so I’d expect some ‘quality’ to their services.

It’s not just about complaining and whining. The earlier model does not have the slowness issue, but the newer one does. As we’re building commercial applications based on that, we’d like this problem to be fixed. Right now, with those delays, the model is simply unusable for lots of users (me included).

OpenAI is a commercial company now. If a service is substandard (and it is, I’ve never experienced those delays) then customers will complain, until it’s fixed. Looks normal to me


Whole Time I thought my code was the problem. switched back to the old model, and everything is okay.

don’t blame ur code only blame if its not correct a syntax, the issue are confirmed it’s openai issue not your code :rofl:

1 Like

Adding a 10 seconds timeout seems to resolve it.

A ten second timeout will result in an error being thrown while the AI is writing a long response.

It might only work for you because the python libraries has retries built in and hidden from you.

Friends, I just confirmed: OpenAI released 3.5-turbo-1106 to just annoy Devs! I initially thought you people were just making noise over the issue because I was busy trying to defend the company by studying the Assistants API from the playground. The playground seems to be kinda hospitable to a user. The API is really hostile, eh? It’ll literally choose when to reply to a user, when it should even process your input tokens. Just released to drink input tokens without delivering results. I think we can all agree on the previous model anyone can decide to remain with it and continue sorting out their input tokens and contexts because 1106 is definitely drunk