How Does ChatGPT Match Generated Images to Reference Grids Without Analyzing Them?

Hi OpenAI community,

I recently had an interesting experience with ChatGPT that I can’t quite explain, and I’m hoping someone here can shed some light on it. Here’s what happened:

I uploaded a grid of character images (a single image with multiple characters) and asked ChatGPT to generate a specific character as a standalone image, specifying their position in the grid (e.g., “the first one”). Without me describing the character in detail, ChatGPT generated an image that seemed to closely match the character in the grid, capturing their style and essence quite well.

Here’s what I don’t understand:

  1. How does this work? ChatGPT claims it cannot analyze images directly, so how was it able to reproduce the style and characteristics of the character in the grid so accurately?

  2. Limitations of visual analysis: If ChatGPT doesn’t process the visual content of images, how does it achieve these results?

Any insights into how the system operates in scenarios like this would be greatly appreciated. It was a very cool experience, but I’m curious about the mechanics behind it!

Thanks in advance for your thoughts!

Welcome to the forum!

One example does make a reasonable sample size. Try the process about 50 more times and see if the result is consistent.