I’m now using GPT-4 Vision to describe simple objects with simple text as you can see in the attached image.
The description includes the shape, color, and texture of objects.
The images are very simple, however, GPT4 Vision cannot answer correctly. The performance was seriously terrible.
Then I want to improve the performance from a prompt engineering perspective, specifically, I want to add image-text paired examples.
First of all, I want to find precedents, my question is “Are there actually people who already tried this approach, adding image-text pair as the correct example for the image-to-text task?”
I actually tried to did this approach however, it didn’t work as I expected. Specifically,
The first image shows the Letter T shape, as an example.
The second image shows checkerboard texture as an example.
Then I showed GPT4 other T-shaped objects and a checkerboard object.
Then, it didn’t work.
So my question is do you know if there are already people who tried this, I want to see the research paper for this concept.
I know OpenAI announced already the limitation of GPT4-Vision officially. However, I was surprised the inaccurate performance.
And let me know the performance of GPT4 in your use-case, does that performs well for you?
No, we do not support fine-tuning the image capabilities of gpt-4 at this time.
Using a prompt to improve performance (results) would typically be done using few-shot prompt but I don’t see that working in this case, and would probably get expensive very fast in trying.
Yes. Do you know your AI history? Ever heard of MNIST?
Long story short, GPT-4 with Vision is not currently enabled to do what you seek. It is essentially show it picture(s), it gives a response.