I’m working on an app that uses Davinci Codex. I’ve noticed that each API call to https://api.openai.com/v1/engines/davinci-codex/completions takes approx. 30 seconds to 50 seconds.
My app is deployed on Vercel and it uses the Vercel Serverless feature as a middle layer (This is to hide my openai api key from the user)
The issue is that Vercel requests automatically timeout after 10 seconds. This timeout setting cannot be changed is strongly enforced by Vercel. Although my app works locally (with the same latency) but when its deployed, it times out.
Is there a way to reduce the latency to less than 10 seconds?
Unfortunately, the reason why Davinci-Codex takes a long time to load is also the reason why it also is better than the Cushman-Codex model in terms of accuracy.
I looked at the documentation on Vercel’s website and there is an Enterprise Plan that does increase serverless execution timeout to 30 seconds, but I’m not sure if you are able to do that.
I wonder if setting up your function call for streaming the data in chunks instead of receiving it all at once would allow you to divide the amount of time it takes to receive the data, thus working around the possibility of hitting the max timeout limit.
Does anyone know if that would work? I haven’t yet implemented streaming capability in my own workflow yet but plan to do so.
Unfortunately I’m not able to be a part of the Enterprise plan. Even though the request takes 30 seconds, most of them have been slightly more than 30 seconds so even if I had the Enterprise plan the requests would still timeout.
Your “streaming data in chunks” idea sounds nice but I’ve got no idea on how to implement it with the openai’s APIs.
I just thought of another way but it would require me to redeploy my app on another provider. AWS provides lambda functions whose response times out after 15 minutes. Downside is I have redeploy my endpoint in Lambda, the site would still be in Vercel
Anytime! If you have any difficulties getting it set up, just let us know and we can certainly help! I personally did a lot of head-scratching when first learning AJAX and then trying to implement it correctly into my Django project . The feeling after I understood how it all worked afterwards was quite rewarding, however!
I wonder what it could be then. I’m not too familiar with the service, but it sounds like there is definitely something slowing the incoming packets. Service degradation or perhaps a firewall that is scanning the packets to ensure the data is not malicious in any way. That is quite odd though
Edit: I was wrong, I didn’t know max_token, even if the limit isn’t reached, would affect performance to that degree, but I constantly practice gauging what I think max_token should be set to. That’s interesting!
@DutytoDevelop It’s not a network issue. @m-a.schenk is right.
It’s the tokens. I experimented with diff max_tokens size and it definitely seems to perform much faster with smaller token sizes. Although now I’m wondering how I can re-submit after partial completion to fully generate the code.
Not sure if this helps but if you pass "echo": true in the request, the endpoint will return the concatenation of prompt and the completion text as part of your response. This way I won’t have to manually edit the prompt in the request!
Yep, it’s the max_tokens parameter that’s causing such long request times. The total request time length will be proportional max_tokens, with each additional token adding around ~50 ms for the current version of davinci-codex.
I recommend lowering max_tokens and potentially using cushman-codex.