Davinci Codex on Vercel Timing out

I’m working on an app that uses Davinci Codex. I’ve noticed that each API call to https://api.openai.com/v1/engines/davinci-codex/completions takes approx. 30 seconds to 50 seconds.

My app is deployed on Vercel and it uses the Vercel Serverless feature as a middle layer (This is to hide my openai api key from the user)

The issue is that Vercel requests automatically timeout after 10 seconds. This timeout setting cannot be changed is strongly enforced by Vercel. Although my app works locally (with the same latency) but when its deployed, it times out.

Is there a way to reduce the latency to less than 10 seconds?

Hello @sagarpatel1025,

Unfortunately, the reason why Davinci-Codex takes a long time to load is also the reason why it also is better than the Cushman-Codex model in terms of accuracy.

I looked at the documentation on Vercel’s website and there is an Enterprise Plan that does increase serverless execution timeout to 30 seconds, but I’m not sure if you are able to do that.

I wonder if setting up your function call for streaming the data in chunks instead of receiving it all at once would allow you to divide the amount of time it takes to receive the data, thus working around the possibility of hitting the max timeout limit.

Does anyone know if that would work? I haven’t yet implemented streaming capability in my own workflow yet but plan to do so.

Unfortunately I’m not able to be a part of the Enterprise plan. Even though the request takes 30 seconds, most of them have been slightly more than 30 seconds so even if I had the Enterprise plan the requests would still timeout.

Your “streaming data in chunks” idea sounds nice but I’ve got no idea on how to implement it with the openai’s APIs.

I just thought of another way but it would require me to redeploy my app on another provider. AWS provides lambda functions whose response times out after 15 minutes. Downside is I have redeploy my endpoint in Lambda, the site would still be in Vercel

1 Like

Maybe something is wrong with prompt because you don’t wanna wait 30-50s for reply.

1 Like

@fvrlak that can definitely be a factor for sure!

This post may help @sagarpatel1025:

1 Like

Well in the Playground the responses are very quick. My responses are accurate to an acceptable degree but I can still try to rewrite my prompt and see if it makes a difference.

Thanks for the quick response!

1 Like

Anytime! If you have any difficulties getting it set up, just let us know and we can certainly help! I personally did a lot of head-scratching when first learning AJAX and then trying to implement it correctly into my Django project :sweat_smile:. The feeling after I understood how it all worked afterwards was quite rewarding, however!

1 Like

Hi! We’re not seeing any service issues. A few questions to help debug:

  1. How many tokens are in your prompt?
  2. How many tokens are you requesting in your sample?
  3. Are you using the best_of parameter?

Hey,

  1. 111 tokens in the request prompt.
  2. max_tokens is 1500
  3. I’m not using best_of parameter

I got access to codex a few days ago but I’ve had access to gpt3 for a few months now. I recently added a new payment method since my free tokens expired.

Vercel has been having issues in their network it looks like:


Source: https://www.vercel-status.com/

Thanks for checking! But that issue was yesterday. The status shows no issues today and all systems are operating normally

1 Like

I wonder what it could be then. I’m not too familiar with the service, but it sounds like there is definitely something slowing the incoming packets. Service degradation or perhaps a firewall that is scanning the packets to ensure the data is not malicious in any way. That is quite odd though

Edit: I was wrong, I didn’t know max_token, even if the limit isn’t reached, would affect performance to that degree, but I constantly practice gauging what I think max_token should be set to. That’s interesting!

@DutytoDevelop It’s not a network issue. @m-a.schenk is right.
It’s the tokens. I experimented with diff max_tokens size and it definitely seems to perform much faster with smaller token sizes. Although now I’m wondering how I can re-submit after partial completion to fully generate the code.

2 Likes

Not sure if this helps but if you pass "echo": true in the request, the endpoint will return the concatenation of prompt and the completion text as part of your response. This way I won’t have to manually edit the prompt in the request!

Source: OpenAI API

1 Like

Yep, it’s the max_tokens parameter that’s causing such long request times. The total request time length will be proportional max_tokens, with each additional token adding around ~50 ms for the current version of davinci-codex.

I recommend lowering max_tokens and potentially using cushman-codex.

4 Likes

Just make your own custom Cluster and no more problem.

1 Like