Consistent slow communication after API call

Hi! We are having trouble consistently getting quick responses from GPT4-Vision when making calls from our ESP32 Microcontroller. 2/3 time it will take 5-10 seconds before any response at all is given back to our microcontroller (not the text response but literally any indication that the two are communicating, which used to cause our HTTP requests to time out). This seems to happen irrespective of base64 image size. Since we can stream images over the internet at ~20Fps with the device, I don’t think its a Wifi/hardware problem, so I am not sure what to do… We always end up getting a text response, it will just take an average of 8-12 seconds, which seems slow relative to what you would see in their web UI. Please let me know if you have experienced this or if you have any ideas!

Here’s my guess:

Image models are slow currently because there’s a huge amount of data per image to be processed(relatively compared to the text-only models), and this is preceded by the auto image resizing done on the API side. A lot of things are likely happening behind the scenes on the vision model.

In my experience, the response time gets longer with every image added to the request, owing to data transfer time among other things.

A large value for the max_tokens param is also known to extend the response time.

This response time might improve as vision models come out of the preview.

4 Likes

Thanks so much for your reply! Do you know if there is any way to resize my image/Json before sending it to make the image resizing faster/easier for the API (Like making it the ideal size the API wants beforehand?)

1 Like

Also, I know this is a stretch do you have any clue when they might come out of preview haha?

I recommend the managing images guide from the docs as it’ll be quite useful to help understand how images are handled on the API side.

Also, how you choose to send an image to the model, 'url' vs 'data', affects the response time; both have their own pros and cons.
For example, if a request has a large number of images, it’s recommended to use 'url' to reduce data transit time.

Image size can be reduced by choosing optimum compression.
For example, use PNG for text and simple graphics, and JPEG or WEBP (preferred) for photographs/images with complex color gradients.

Haha! I’m not an OAI employee so I don’t have the faintest idea. :slight_smile:

1 Like

Okay thank you so much for all this information I really appreciate it! Is it okay if I reply to this again If I eliminate the possible sources of delay you mentioned and it persists?

1 Like

One thing not mentioned is the detail parameter, which ensures that only one “tile” is used, and the image is downsized by OpenAI to fit into a 512px box. Otherwise, large images are still resized larger and are broken into several tiles as a way to give higher quality inspection.

Resizing and recompressing yourself (WebP anyone?) is certainly possible, and were it a fast CPU device, easily could improve the speed.

I don’t agree with the “use a URL” idea instead of sending your own base64 as a speedup. If you already have the image (and you can be creative and be prefetching and resizing as soon as your user input box gets the location) then how would it be faster to have OpenAI unreliably download from another site by itself, where the image may be much larger and out of your control? Or even for you to spend time uploading it to another site first instead of just sending direct to the model at the desired destination size? The only case where you may seem some acceleration is if you continue to include and pay for past images in a chat history, then there may be some web cache server side.

The vision AI model is just downright slow though; even just chatting with it has a pokey latency and token production rate.

Okay thanks I totally see your point with your URL argument and will give the detail parameter a try! Also yea I agree a lot of this is definitely just the model I might switch to Claude 3 Haiku if need be…

Yes. Post it even when it works to help other devs cut the response time.