How to add correct examples for image-to-text task

shure.alpha · December 27, 2023, 4:34am

I’m now using GPT-4 Vision to describe simple objects with simple text as you can see in the attached image.

The description includes the shape, color, and texture of objects.
The images are very simple, however, GPT4 Vision cannot answer correctly. The performance was seriously terrible.

Then I want to improve the performance from a prompt engineering perspective, specifically, I want to add image-text paired examples.
First of all, I want to find precedents, my question is “Are there actually people who already tried this approach, adding image-text pair as the correct example for the image-to-text task?”

I actually tried to did this approach however, it didn’t work as I expected. Specifically,

The first image shows the Letter T shape, as an example.
The second image shows checkerboard texture as an example.

Then I showed GPT4 other T-shaped objects and a checkerboard object.
Then, it didn’t work.

So my question is do you know if there are already people who tried this, I want to see the research paper for this concept.

I know OpenAI announced already the limitation of GPT4-Vision officially. However, I was surprised the inaccurate performance.

And let me know the performance of GPT4 in your use-case, does that performs well for you?

scene_front_7
scene_front_5

Diet · December 27, 2023, 4:54pm

I dunno bro, that doesn’t look like a T-shaped object to me

There are some tricks that help with specific tasks, but I don’t really understand what you’re trying to do. Are you just trying to classify images?

shure.alpha · December 27, 2023, 10:14pm

Kind of, yes.

I have the image set of objects to be described, and all options of shape, texture, and color to be chosen to describe objects.

Like CLIP for image to text, classification tasks.

Sorry for the confusion about the attached images, I just wanted to show the image-specific example.

shure.alpha · December 27, 2023, 10:15pm

What I want to do is very simple.
I want to describe image with simple description using GPT4-V
But GPT4V doesn’t work well
So is there good idea?

EricGT · December 27, 2023, 10:28pm

This would typically be done using fine-tuning.

From GPT-4 with Vision FAQ

Can I fine-tune the image capabilities in gpt-4?

No, we do not support fine-tuning the image capabilities of gpt-4 at this time.

Using a prompt to improve performance (results) would typically be done using few-shot prompt but I don’t see that working in this case, and would probably get expensive very fast in trying.

Yes. Do you know your AI history? Ever heard of MNIST?

Long story short, GPT-4 with Vision is not currently enabled to do what you seek. It is essentially show it picture(s), it gives a response.

shure.alpha · December 29, 2023, 2:24pm

I found interesting research

Topic		Replies	Views
Gpt-4 vision few shot prompting with images API	3	3834	May 29, 2024
How to Include Image-Text Pairs as Few-Shot Examples in Prompts? Prompting api	3	290	April 17, 2025
It is possible to have better performance by using few-shot prompting with image inputs and structured outputs? Prompting gpt-4	0	72	March 20, 2025
Image mapping with prompts API gpt-4 , chatgpt , gpt-4-vision	1	991	July 19, 2024
Prompt upscaling image text prompt on chatgpt by gpt before going to dall-e? Prompting chatgpt , prompt , prompt-engineering , dalle3	0	506	December 13, 2024

How to add correct examples for image-to-text task

Related topics