Working with a developer on a web app. It is taking something like 50 seconds to a minute and a half to generate the completions.
This is not acceptable, as in no one is going to wait around that long on the internet for the content to load.
I have been told we can use a lesser model, try to reduce the tokensā¦not sure how to proceed but just feel like we are currently at a bit of an inpass if we canāt get speeds better.
Any ideas?
Have you tried using streaming responses? That gets the first token faster. I have a little wrapper to make that easier in typescript/javascript
But should be able to be done relatively easily in python too if you like, with a little bit of effort.
2 Likes
Will your wrapper work if completion is in a different language to English?
I donāt see any reason where it shouldnāt, in my service users use any language they want, and in the background I use this wrapper 
What model are you using? What size of prompt (input and output)? Multiple parallel requests?
1 Like
I have a developer working on my project. Are you available to review and help? We could look at payment if you can identify and provide solution.
Would be weird if your developer did not understand how to implement this. If your developer has issues, I would be glad to help 
I see your pain, itās mainly GPT4 that has the slow response times. Some other stuff I try to do, is to get the answers in shorter form by giving it strong commands in the system message. The response is only sent when the message has finished. Often times the response is filled with all sorts of irrelevant courtesies.
Just one more tip, just in case:
If the completion is shorter, the response speed increases.
The underlying reason is that models like GPT-4 generate output sequentially, one step at a time, which naturally takes longer for longer outputs.
Hence, it explains why there is a difference in completion speed between queries like āTell me about New Zealand?ā and āHello.ā
For instance, āHello.ā is completed approximately five times faster. 