Seeking Advice: Enhancing Accuracy of GPT-4 with Vision

Hello everyone,

I’ve been using the OpenAI API for some basic testing until now, and I’m planning to integrate it into a project of mine. Here’s the situation: I have a large set of multiple-choice questions, each accompanied by an image, the question itself, and possible answers.

My initial idea was to use the ‘gpt-4-vision-preview’ to analyze the image and explain why a certain answer, let’s say ‘X’, is correct. But then I thought of a different approach. Instead of giving it the correct answer, I could let the model identify the correct answer itself and then provide an explanation. This method, I believe, could add more credibility to its explanations since the model would be independently identifying the correct answer.

In my initial tests yesterday, the model performed flawlessly, not missing a single question. However, in today’s tests, I noticed some errors. The model was confusing certain questions due to similar terms used in the answers. In these cases, the specific terminology is crucial, even though the terms might generally refer to the same thing.

To tackle this, I’m thinking of using ‘gpt-4-vision-preview’ to describe the images in detail and then fine-tuning a model with every guideline from a comprehensive document I have. This might lead to more accurate results.

Since I’m relatively new to the OpenAI API, I’m not entirely sure if this is the best solution. Does anyone have any suggestions or know of any articles that might help?

1 Like

Hi and welcome to the Developer Forum!

This is all new! The best thing to do is try it and see what results you get.

The reason Ai is generating such a buzz is this is more like the discovery of electricity than it is a progression of computing, it’s all new, everything is up for grabs and there are no dusty old text books full of best practice yet.

Experiment and let us know how you get on.


Thanks for the feedback!

I plan to continue experimenting and will use this thread to share my results.


Hello,friend. i want to know what kind of picture you want the gpt process? for different kinds of picture it shows different ablity. For example, it can not count the number of anything but can recognize a location just through some building. do you have any ways to solve this problem?

Hi @pzdzxlx, I required assistance in analysing images and providing accurate responses to associated questions. The process is akin to a quiz, where each question is accompanied by an image, the question itself, and multiple possible answers. To enhance accuracy, I’ve addressed the model’s limitations in specific terminology by constructing a comprehensive vector database encompassing all relevant knowledge. I then embed the top query result from the vector database into the prompt, thereby achieving more precise and reliable outcomes.

1 Like

I did try a different approach for my use case.
I had to derive data from image and I used to get lot of hit - misses in output.

Reading the doc, I came to know that vision breaks images into 512 * 512 format.

So if your input image is greater than 512 by 512, it will convert it to 512 by 512 (also noticed if image was originally high detailed, the accuracy was good) but problem arises if image is below 512 by 512, in this case, it will stretch the image to match the desired format.

I converted all my images to 512 by 512 by default in combination with OCR enhancements set to true and details set to High in payload and voila, most of my images got correct outputs.

Also preprocessing the image , adding grayscale, sharpening it etc might help.