When will vision API become available?

Is there any timeline on when the API will become available for uploading images and having a conversation about them?


There’s no official timeline, but the release statements said it would come to developers “soon after”

So I’d say somewhere between 2weeks and ~1 month maybe :thinking:


I would be a little more conservative.

1–3 months, almost certainly by the end of the year.

It’s also something I could see being announced at Dev Day.


Whats included? Not worth building then

1 Like

I’d like to know as well. I imagine they are not doing anything sophisticated on the backend. Probably just a vision classifiers/describer that is injecting the results into an LLM and then spitting out text based on some instructions. I think a lot of people overestimate the craziness of what OpenAI is doing in the backend.

The magic of what they have built is the LLM, pretty much all of the other stuff is well known and done better by someone else. Depending on the cost and need, it might be worth building it in house. Wouldn’t be that difficult. Both Amazon and Microsoft have visual APIs you can bootstrap a project with. Probably get it done way faster than the OpenAI team. I know I only took about 4 days to integrate a local whisper instance with the Chat completions to get a voice agent. I suspect visual inspection and format detection would be easy enough to integrate.

1 Like

If I am remembering correctly, the LLM was trained on image data as well, so I think it’s a bit more sophisticated than upcycling some CLIP output.

I’ve been experimenting more with Bing Chat and Bard image uploads in anticipation of GPT-4V dropping soon and they’re starting to get good, but there’s still a lot of room for improvement.

1 Like

I find this consistent developer-second approach concerning tbh. I understand why OpenAI pushes it’s own products first but these delays and limits on the API vs their own product does make me wonder how big of a priority developers are for OpenAI.


I am trying to curb my expectations but I wish that it will not only describe what is the image

input: an image of a fruit
banana: 0.76,
apple: 0.23,
orange: 0.15,
grapes: 0.08

but it can also answer my questions regarding the image

input: some random image

query: what is the purple object in the image?
output: barney the dinosaur

query: how many purple objects in the image?
output: 3

query: what is the location of the purple object in the image?
output: { top: 50, right: 327, bottom: 125, left: 85 }

query: how far is the purple object from the camera?
output: 5m

Are you saying that GPT-4V can’t do those?




ChatGPT-4V can, but GPT API still does not offer image-based function even if it is of GPT-4.

1 Like