You are probably waiting on the response to be generated.
Set the response to max_tokens = 1 and see the processing time of an image input:
SENT:
Count the number of humans. In a list, offer a short description of each by ethnicity and features, starting at left.
Uploaded images:
brochure.jpg[first token: 3.07 seconds]
There
[elapsed: 3.08 seconds]
(gpt-4-turbo was 4.19 seconds to return the first token)