Chat with images is rolling out now

Can you try Where’s Wally, please? Give an image of Where’s Wally and if ChatGPT can locate Wally.


Here it is:


Nice! Not seeing it yet here, but I’m happy to wait…

1 Like

I don’t know if it’s right or not.

I’m colorblind and hate these damn things. Lol.

Edit: I think I found him,

I’m not sure GPT-4V was accurate in the description. But, it’s a pretty low-res image.


I asked it to automatically split the text of a cookie recipe into several paragraphs and then create a on the fly knowledgegraph.

It does work.

Edit 1: Apparently GPT4-V and advanced data analysis (Code Interpreter) are two separate models, so we cannot use the new functionality there.

Edit 2: gave it a good photograph of a information about fire alerts in German and asked it to explain it. The model auto-translated the text into English. The output was a organized mix of both languages but this confirmed the tool can read languages other than English.

Edit 3: then returned to another conversation from earlier where a image had been uploaded and the file was still accessible. Unlike advanced data analysis where another session may cancel older ones.


Playing around with the new feature these are some more observations.
I will mostly focus on technical stuff because that is relevant in any case.
Here are some more initial findings.

  1. You can upload 4 images at a time.
  2. It appears the max size of an image is around 10 - 15MB.
    When trying to upload an image that is too large
  • the request times out without further notice,
  • a error message is generated before a reply is generated but after clicking “send”
  • or the upload will finish but the model will refuse to analyze the image (too large). When asked about the maximum file size for an image upload the model will hallucinate a reply. It does not know.
    Note: It may also be true that the maximum filesize is currently being adapted due to demand. Take this number with a grain of salt.
  1. It is possible to upload several large image files within a conversation and thus exceed the previuous stated limit a little bit. Apparently there is a limit per message then the model will reply with the error message before sending a reply.
  2. When a message to the model had a image attached to it, it is not possible to later edit that message and branch out the conversation. But this very helpful option is still available for standard messages without images.
  3. The model appears to have issues with filenames. When asked to compare enlarged_full_3.png and enlarged_full_2.png it provided the correct answer but mixed up the file names.

And on a side-note:
6. The model can work with numbers. But you need to be carefull. At first I asked it to identify the id of an element from a webpage using a screenshot and it was not able to return the correct value. Also, from there the model started to hallucinate more wrong answers in the same reply. Regenerating the reply did not resolve the issue. Note that it was a screenshot from a 4k display, so very good quality.
Then I cropped the image to only display the relevant element with the number and it was still incorret.
Then I enlarged the cropped element by 300% and now it was able to read it.
Then I enlarged the whole image by 300% and now it was wrong again.


Can you send images through the API?

Not at the moment, there will be an announcement when that is supported.


Relevant bits from a conversation export,

And, while you can upload many pictures in subsequent messages, there’s definitely some context culling happening.


I am still waiting on DALL-E 3 and Vision. I am based in Canada.

The JSON is the request part, right? Do you also have the response part? I am curious about the format of the output, what to expect. I know that this is for ChatGPT but perhaps it will retain a similar format for the API.

Here’s an example of the response,

I don’t see anything particularly interesting there, other than the model slug is just gpt-4, so they’re not sending the messages to a different, specialized, version of the model.


What is interesting? Watch for stop token 100265 when using function/plugin/GUI output.

In your log, you see a stop token 100260. You will also see 265 in use. It gives you hints of how the AI has been trained on containers for other output than just text…that’s about the only thing interesting to be learned from communication dumps.

As you can see, the endpoint is no longer encoding AI-produced text into tokens, it has to emit them directly. Maybe because someone trained the AI on making them even though they are filtered from input…

1 Like

Anyone has try CHATgpt Lego ?

Take a picture of random Lego pieces and ask ChatGPT to build something for me


I was wondering why, in my previous tests, the numbers from the screenshot were read incorrectly, even though it was a high-quality picture taken from Chrome developer tools. On the same day, I watched some YouTube videos that displayed use cases where the model essentially excelled at this task.

As it turns out, the large image dimensions have a detrimental effect on the quality of the readings. When provided with a screenshot from Chrome developer tools with dimensions of 3840x2160 (4K) and asked for a number, the model can recognize which specific number is referred to but cannot read the exact number. However, when provided with an equivalent screenshot with dimensions of 1920x1080, the model reads the number correctly. Additionally, the overall amount of information on the image plays a role. When cropping the image to a section containing the relevant information and then scaling up to 4k dimensions, the model can read the number even though the image quality is reduced.

Intuitively, this makes sense. Ensure that you only include relevant content for the best results, and this can be further improved by reducing the image dimensions.

1 Like

There is a Arxiv paper from Microsoft where they explore the capabilities of the GPT-4V model in depth.

Here is the link to the arxiv pre-print:

For those who are looking for a tl/dr here is the video from AI Explained:


Looking forward to the API access for this feature.


If you have time this weekend, can you feed it a page or two of this? I’m curious as to what it “sees”…

1 Like

Actually a good idea, it would be quiet interesting to see what happens :thinking:

1 Like