Working with a developer on a web app. It is taking something like 50 seconds to a minute and a half to generate the completions.
This is not acceptable, as in no one is going to wait around that long on the internet for the content to load.
I have been told we can use a lesser model, try to reduce the tokens…not sure how to proceed but just feel like we are currently at a bit of an inpass if we can’t get speeds better.
But should be able to be done relatively easily in python too if you like, with a little bit of effort.
Will your wrapper work if completion is in a different language to English?
I don’t see any reason where it shouldn’t, in my service users use any language they want, and in the background I use this wrapper
What model are you using? What size of prompt (input and output)? Multiple parallel requests?
I have a developer working on my project. Are you available to review and help? We could look at payment if you can identify and provide solution.
Would be weird if your developer did not understand how to implement this. If your developer has issues, I would be glad to help
I see your pain, it’s mainly GPT4 that has the slow response times. Some other stuff I try to do, is to get the answers in shorter form by giving it strong commands in the system message. The response is only sent when the message has finished. Often times the response is filled with all sorts of irrelevant courtesies.
Just one more tip, just in case:
If the completion is shorter, the response speed increases.
The underlying reason is that models like GPT-4 generate output sequentially, one step at a time, which naturally takes longer for longer outputs.
Hence, it explains why there is a difference in completion speed between queries like “Tell me about New Zealand?” and “Hello.”
For instance, “Hello.” is completed approximately five times faster.