I’m using http calls to the gpt-4 model API from a NextJS app. They work quite well, with one exception: waiting for server response. When I look at the network requests and responses in the browser dev tools, all metrics are in the milliseconds range, but “waiting on server response” is in the 20 to 30 seconds range. The queries involve simple math and algebra questions and total tokens are around 1,000. Anyone have any suggestions? [UPDATE: In reviewing OpenAI online help, I found this regarding rate tiers: “Organizations in higher tiers also get access to lower latency models.”. Since I only started recently testing the API, and have only spent $US 12 so far, that make explain the slow response - I’m in a higher latency tier.]
You could listen for streaming response, instead of waiting for the full completion to be generated
May I know what method you are using?
Chat Completion or Assistant?
I’ll look into that, thanks. Although, the response takes less than a second to download, so it is not a lot of content. Attached is a screenshot of typical response timing.
I’m using Chat Completion. I get similar results using the gpt-4 and gpt-4-1106 preview models.
I suppose streaming might be able to provide you a better experience, at least you know that it’s responding. However, I suppose response time below 1 minute for 1k tokens is quite reasonable.
I just gave both chat completion and assistant application a shot. Both are below 1k for simple interaction like “give me a code snippet to parse csv file in python”.
Thanks very much for your help. I’ve been developing the app for a few weeks and have done hundreds of similar math/algebra queries as part of testing. In Playground or ChatGPT-4 the LLM responds in seconds. It’s only the API calls that are slow, and the problem has gotten worse in the last few weeks. It suggest to me OpenAI isn’t handling the API requests queue very well, but it doesn’t seem many other people are having a similar problem.
I thihnk, streaming isn’t so much about download speed, but how soon the server begins sending the response. Wait for the full 1000 tokens to be generated, or start receiving after the first ten? If that makes sense.
Thanks, I’m just using an axios library for async calls. I’ll try the OpenAI library (which supports streaming) and see if that helps. I appreciate your response.