So I actually found out that is not a performance issue actually, is just the nature of transformers.
If you do a post request instead of streaming, the application will need to generate the response word by word recursively using the previous string of text to predict the probability of the next word, each iteration takes about 0.1 to 0.5 seconds, so a large paragraph will take 15 to 45 seconds to fully generate.
To mitigate this you need to do HTTP streaming to stream the response to the user so that they can see each word being generated so that they know something is happening and won’t get desperate.
I am getting a bit worried, I am currently trying to build an application using GPT-3, and I need to make from 2 to 5 requests to the completion endpoint to get a final result. Each request is taking increasingly more time, and I am making all of them in parallel but any of them can take from 12 to 40+ seconds, the last request is sequential so I can’t avoid adding those extra 12 to 40+ seconds. Is there any chance this situation is temporary and is OpenAI working on improving performance, I need to know because I am making some serious investment in this project. I really appreciate any help you can provide.