It takes me 12 seconds to get the response. Where am I going wrong?

You are probably waiting on the response to be generated.

Set the response to max_tokens = 1 and see the processing time of an image input:

Count the number of humans. In a list, offer a short description of each by ethnicity and features, starting at left.

Uploaded images:

[first token: 3.07 seconds]
[elapsed: 3.08 seconds]

(gpt-4-turbo was 4.19 seconds to return the first token)