Identifying objects' positions on a whiteboard with grid

bsamuele · June 30, 2025, 7:05am

Hello,

I am trying to accomplish the following task:

Given an image of a whiteboard containing handwritten text, transcribe the contents of the whiteboard and, for each element, identify its pixel position inside the whiteboard.

Given that vision models aren’t very good at finding bounding boxes and the like, and following the suggestions in another thread, I tried overlaying a grid on the image.

Here’s an example image:

Here’s the prompt I’m using with GPT-4o

You are given an image of a whiteboard. A transparent grid with fixed-size cells (e.g., 50x50 px) is overlaid on it. Each cell has a unique identifier (e.g., "032", "140") visible in the top-left corner of the cell.

Your task is to answer with:

1. **A description of the content**: transcribe the text exactly as it appears (for example: `equation "3x + 5 = 10"`).

2. **The list of cells occupied by that element**, in the format: `[032, 033, 040]`.

Example output:

Initial equation 3x + 5 = 10 - cells: [0002, 0003, 0004]

Next step 3x = 5 - cells: [0010, 0011, 0012]

Final result x = 5/3 - cells: [0020, 0021]

Arrow connecting step 2 to 3 - cells: [0015, 0016]

and here’s its response to the above image:

As you can see, it’s giving somewhat-accurate results, getting some cells right, hallucinating others, and ignoring some others yet.

What could I do to improve the results?

I can imagine some of the errors are coming from the cell IDs being partially or completely covered by some content, so maybe having the IDs on top of the content and of a different, more contrasting color may help. Maybe the size of the grid cells could be adjusted too, it’s 50 x 50 currently.

Have you ever dealt with this type of task? Considering on my frontend I am able to obtain the bounding boxes of all the elements in the whiteboard (i.e. the single strokes making up the text), although obviously without any semantic meaning to them until I send them to GPT, are there better ways I can use to get accurate results?

Many thanks

Topic		Replies	Views
Identifying pixel positions of elements in an image API	3	227	March 17, 2025
Getting GPT Vision To Return Coordinates Prompting gpt-4 , gpt-4-vision	9	8001	May 19, 2025
GPT-4o Model: Image Coordinate Recognition API gpt-4	32	6433	June 24, 2025
My GPT is not reading images well enough GPT builders gpt-4 , gpt-4-vision	3	182	August 13, 2024
A model with a better understanding of the grid structure API fine-tuning , prompt-engineering	0	40	March 16, 2025

Identifying objects' positions on a whiteboard with grid

Related topics