Can GPT-4o be fine-tuned to find coordinates of specific object in an image?

kaihuchen · October 4, 2024, 6:35pm

I am doing some research on computer vision using a multimodal LLM like GPT-4o, where I want GPT-4o to give me the exact coordinates of the bounding box of a certain object in an image.

What I find is that the bounding box found by GPT-4o is often off by some non-trivial margin. My question here is whether GPT-4o can be fine-tuned to improve in this at all, and if so then what kind of training data are required?

Following is an example (image a home robot use case) just to demonstrate what I am talking about:

Me: Give me the bounding box coordinates (top-left and bottom-right points) in pixels of a knife in this image
GPT-4o: Based on the analysis of this image, the approximate bounding box coordinates for the knife are as follows:

Top-left corner (x1, y1): (1000, 1500)
Bottom-right corner (x2, y2): (1200, 1650)
These coordinates should enclose the knife seen on the floor in the image. Let me know if you need further adjustments!

The green rectangle was added by me to show the location of the bounding box returned by GPT-4o.

Topic		Replies	Views
GPT4 V Object detection bounding box value incorrect Prompting gpt-4 , gpt-4-vision	1	2770	June 29, 2024
Getting GPT Vision To Return Coordinates Prompting gpt-4 , gpt-4-vision	11	10936	March 2, 2026
Multimodal gpt-4o-mini Fine tuning API	7	2251	October 18, 2024
When can we fine-tune the image capabilities in GPT4? API gpt-4 , fine-tuning , fine-tuning-problems	0	1800	November 12, 2023
My GPT is not reading images well enough GPT builders gpt-4 , gpt-4-vision	3	250	August 13, 2024

Can GPT-4o be fine-tuned to find coordinates of specific object in an image?

Related topics