Slow responce from GPT 3.5 Turbo API

Hi All,
I’m developing an application that depends on communicating moderate size word documents to Gpt3.5 turbo API and receiving a response , the problem is that processing one request takes about 80 seconds which is huge noting that I intend to make this a corporate grade application that would need to process hundreds of documents thus will be too slow for the user , any clue to speed things up ?

Sometimes the API can be slow due to high demand. You could try using Azure OpenAI, which is approximately x2 faster for half the cost.

If you don’t want/can’t use Azure OpenAI, you could try breaking down your requests into smaller bits. For example, instead of moderating the entire document at once, you could try and break it to, let’s say 5 pieces and then send them to the API.

Additionally, AI services employ a technique called streaming, where the text generates progressively as the process unfolds, creating the effect you see in ChatGPT. It’s more appealing visually and would provide your users a better experience.

1 Like

Thanks a lot ,I thought OPenAI azure would be more expensive

Indeed, speed and performance can fluctuate quite a bit. Here are a few tips and reference links you could have a look at:

  • Data Chunking: For processing large texts, consider chunking them into smaller, manageable pieces without losing context. I remember finding this video pretty useful: video on Data Preparation for LLMs.
  • Asynchronous Processing: Handle API responses in the background. Asynchronous Programming.
  • Enterprise Solutions: Have a read of this NVIDIA blog post, with their approach to retrieval-augmented generation apps. blog post.
1 Like

Hi aelfouly. You have a good question and one that is only hinted at within your account and documentation.

Usage tiers

You can view the rate and usage limits for your organization under the limits section of your account settings. As your usage of the OpenAI API and your spend on our API goes up, we automatically graduate you to the next usage tier. This usually results in an increase in rate limits across most models. Organizations in higher tiers also get access to lower latency models.

What OpenAI doesn’t articulate clearly is “those that haven’t paid at least $50 into their account will now have their organization set to a reduced token production rate, slowing the responses from AI models.”

On that linked limit page, it will show how to increase to the next tier at the bottom (by more prepayment), but won’t say exactly the additional amount needed - you have to do the math yourself.

“Assistants” themselves are slow if you are using that, especially when you are asking the API to process files into text the AI can understand. Then the assistants AI is also slowed, needing to chunk and browse that text if large. There is also no reporting of per-run tokens consumed to let you understand assistants API call costs.

Corporate grade means making your own vector database with knowledge, and using semantic search to return relevant results to a user’s input.

This is really informative , thank you so much

Have a look to one of the community leaders tips on thanking, references provided in the thread. I’ll link it here:


Thanks , but this talks about the limits not the processing speed , what I’m facing is 80 secs of delay when processing a single request which could be drastic when scaled up to a large number of documents , of course the limits problem will appear then buts not what I’m faacing now , I need to process the request instantaneously the same way chatgpt is processing the request

I did some quick testing using a free-tier account. 3.5-turbo-1106 has decent speed. 3.5-turbo outputs at half speed, which is not as bad as it was in October and November.

To diagnose a speed issue, the first thing to do is go to the playground and compare streaming speed with free ChatGPT.

Playground IS your API account and its capabilities.

So you get an evaluation similar to if you programmed it yourself. Programming it yourself would let you measure output speed better though, in terms of time to get the first token, and then the token generation rate after that - when using chat completions endpoint and streaming.

However, I think the overall concern in this topic might be the use of “assistants”, which has not been clarified. Assistants do not stream interactively for user satisfaction; instead you must wait for the entire queued job to be done.

Comparison to ChatGPT? ChatGPT has its own independent qualities. It can show a fast response while your API account is being throttled for being at low payment tier - or ChatGPT can be down and nonfunctional while your API account and model remains in service. They might be testing a model unavailable to you.