I am doing some research on computer vision using a multimodal LLM like GPT-4o, where I want GPT-4o to give me the exact coordinates of the bounding box of a certain object in an image.
What I find is that the bounding box found by GPT-4o is often off by some non-trivial margin. My question here is whether GPT-4o can be fine-tuned to improve in this at all, and if so then what kind of training data are required?
Following is an example (image a home robot use case) just to demonstrate what I am talking about:
Me: Give me the bounding box coordinates (top-left and bottom-right points) in pixels of a knife in this image
GPT-4o: Based on the analysis of this image, the approximate bounding box coordinates for the knife are as follows:
Top-left corner (x1, y1): (1000, 1500)
Bottom-right corner (x2, y2): (1200, 1650)
These coordinates should enclose the knife seen on the floor in the image. Let me know if you need further adjustments!
The green rectangle was added by me to show the location of the bounding box returned by GPT-4o.