Identifying objects' positions on a whiteboard with grid

Hello,

I am trying to accomplish the following task:

Given an image of a whiteboard containing handwritten text, transcribe the contents of the whiteboard and, for each element, identify its pixel position inside the whiteboard.

Given that vision models aren’t very good at finding bounding boxes and the like, and following the suggestions in another thread, I tried overlaying a grid on the image.

Here’s an example image:

Here’s the prompt I’m using with GPT-4o

You are given an image of a whiteboard. A transparent grid with fixed-size cells (e.g., 50x50 px) is overlaid on it. Each cell has a unique identifier (e.g., "032", "140") visible in the top-left corner of the cell.

Your task is to answer with:

1. **A description of the content**: transcribe the text exactly as it appears (for example: `equation "3x + 5 = 10"`).

2. **The list of cells occupied by that element**, in the format: `[032, 033, 040]`.

Example output:

Initial equation 3x + 5 = 10 - cells: [0002, 0003, 0004]

Next step 3x = 5 - cells: [0010, 0011, 0012]

Final result x = 5/3 - cells: [0020, 0021]

Arrow connecting step 2 to 3 - cells: [0015, 0016]

and here’s its response to the above image:

As you can see, it’s giving somewhat-accurate results, getting some cells right, hallucinating others, and ignoring some others yet.

What could I do to improve the results?

I can imagine some of the errors are coming from the cell IDs being partially or completely covered by some content, so maybe having the IDs on top of the content and of a different, more contrasting color may help. Maybe the size of the grid cells could be adjusted too, it’s 50 x 50 currently.

Have you ever dealt with this type of task? Considering on my frontend I am able to obtain the bounding boxes of all the elements in the whiteboard (i.e. the single strokes making up the text), although obviously without any semantic meaning to them until I send them to GPT, are there better ways I can use to get accurate results?

Many thanks