ChatGPT goes Multimodal! Sound and vision is rolling out on ChatGPT

Exciting news update! ChatGPT goes multi modal!


This does not mention API usage. When will I be able to submit images through the API?

Any information / estimates are helpful, thanks!


Nothing official yet, just need to be patient, I’m sure the API will follow soon.


Very exciting news I am looking forward to speaking with ChatGPT


Wow. This is incredible. Although I haven’t received the update on my phone yet I can’t wait to try out some of these features. Going on hikes, spotting birds, even discussing national wonders such as Machu Pichu just got so much more interesting :heart_eyes:

It was only less than a year ago Davinci convinced me that I had to remove the brake lines on my car just so that I could remove the rotor (bad), and didn’t suggest flushing the lines before driving off (very bad). So the good ol’ mechanic test will also be interesting. Although looking at the report it seems like the model heavily leans towards “Nope, not doing that”. Which, I is fair.

I am very interested in knowing how the API will work. Will it be possible to generate and return embeddings of images? I could embed images of mushrooms for my database & determine if they are safe to eat. Start with GPT identifying what it knows and then build on top of that.

But, I am also worried by this. I really do appreciate their stance on identifying & discussing people. Using this someone could track and publish the actual whereabouts of public figures through public camera systems.


So exciting, can’t wait to try this!

We’re rolling out voice and images in ChatGPT to Plus and Enterprise users over the next two weeks. Voice is coming on iOS and Android (opt-in in your settings) and images will be available on all platforms.


What makes you say this? To my knowledge, capabilities like web browsing were not released to the API. I understand they are very different things, just curious if you have some extra insight.

1 Like

I would imagine people will wish to be able to include images with prompts now that ChatGPT can do it, most of the other features can already be done, and to some extent, so can multi model, but it would be nice to have a fully integrated image and text API, we’ll have to wait and see.

1 Like

March 2023, gdb. ( No, ChatCompletions does not support submitting a list. )

1 Like

Reading through the documentation I found that they gave beta access for Be My Eyes. I think it is amazing seeing all the ways this wonderful new technology can help people.

1 Like

So now it has eyes and ears. Much closer to having actual understanding of what an apple is. Looking forward to try that asap.


What do you mean? It’s definitely an ipod;


I’m eagerly awaiting the API. The fact that ChatGPT is becoming multimodal is truly amazing. However, without access to the APIs, my options are limited. Therefore, my current task is to persuade my boss and colleagues that the API isn’t available yet. Often, when they come across information from OpenAI, they assume the APIs are already prepared and stable. :sweat_smile:

1 Like

You’ll have to brace yourself for a few more weeks :wink:

Plus and Enterprise users will get to experience voice and images in the next two weeks. We’re excited to roll out these capabilities to other groups of users, including developers, soon after.

(Emphasis is mine)


Watching the [ChatGPT can now see, hear, and speak](video about image chatting) got me thinking…

The thumbnail shows the image zoomed in with a part circled. I initially thought this was going to be from ChatGPT.

While it was still very impressive, it got me thinking—how awesome would it be if you could send ChatGPT a picture of something and it could draw on the image (circles, arrows, etc) to point things out to you…

Especially if it was able to connect in to DALL-E to produce illustrated guides.

Hell, connect it to the Internet too.

In the future I imagine a model will,

  • Accept the picture of the bike
  • Identify the bike brand and model
  • Locate the manual for the bike
  • Provide detailed and illustrated step-by-step instructions for lowering the seat including a picture and description of the required tool

In the far future maybe it’ll create a quick tutorial video where an avatar demonstrates lowering the seat on an exact copy of the bike…


It is doable. The AI already know the position of the object in image. If you check other object/face recognition projects in the web, they usually show a bounding box around the detected parts, even in real time. But I hope if they will implement it in ChatGPT, they’ll use a scribed circle as if it is drawn by pen/marker. It would be visually pleasing that way.

1 Like

When a plugin generates an image how can the model see it?

1 Like

Oh, it’s absolutely doable but it’s another layer on top of what they’re already working on.

I don’t expect we’ll see it this year, or even maybe next.

1 Like

I agree, OpenAI’s team creates incredible programs but it also takes time. After ChatGPT4’s release they had more time to work on DALL-E and recently announced DALL-E-3. They are switching there main focus between different programs and I personally am fine with it.

1 Like

I’m pretty sure they are mostly different teams working on their own products. The underlying technologies are very different.

1 Like