I’m building an application that achieves the following:
the user submits a handwritten essay
the application recognizes the handwritten text, splits it into the main logical / conceptual blocks, and makes comments / annotations / corrections for each block
the annotations are overlaid on the original images, each annotation being placed near the part of the text it refers to
For the second step, I am issuing a prompt to gpt-4o that looks like this: " You are a teacher and you are correcting the errors in the solution of a math exercise. You are provided with an image containing the handwritten text submitted by the user.
Identify the main logical blocks that make up the essay. For each block, indicate whether it is correct, partially correct, or incorrect and, if it is not correct, explain the reason for the error."
I am using structured outputs to get an array of objects containing the required fields, and it’s working pretty well, both the handwritten text recognition part and the annotation part.
I have noticed that the model does not seem to be able to identify the bounding boxes of the parts of the text that make up the blocks, not even in a ballpark.
What could be a working solution to implement this?
An idea I’ve had is to use an object detection model specifically trained for detecting separated lines or blocks of text, but this would only help partially, because the division between blocks is logical and isn’t always represented by a blank space, as in between paragraphs.
Is there a way I could use the existing models to get even an approximation of the bounding boxes that enclose the parts of recognized handwritten text in an image?
Use an OCR for this task, not a VLM (Vision Language Model). You are not only seriously risking hallucinated data (handwriting can be very diverse in readability) but VLMs like GPT-4o do not capture bounding boxes.
You could use a VLM as a complementary service, but OCR is specifically made for this task.
Thank you for your input. I thought that an LVM could be a good fit for this task just because, unlike a simple OCR (I don’t know if what I’m saying makes sense), it can gather context around the text that it’s recognizing, therefore possibly making up for parts of the text that may not be very intelligible by “figuring out” what is supposed to go there.
I would also need to check whether there are good OCRs for recognizing math expressions and turning them into LaTeX (something I failed to mention previously is the text can include math); the VLM models seemed to be very good at this, although I realize that having context and “understanding” about the text may lead to a bias toward what the model expects to find, that is, what is “supposed” to be there, and therefore lead to hallucinations like you mentioned.
Yes, VLMs have the benefit of inferring the characters while OCR provides a more deterministic way to extract the characters.
The golden issue with VLMs (And LLMs) are hallucinations.
What I like to do is run both an OCR and VLM. First the OCR to see how digestible the content is, and then a VLM that’s guided by the OCR’d content to finalize it.
Since the OCR provides bounding boxes you can perform work on specific areas as well. Now you have the option of cropping, adjusting, and focusing on specific difficult areas.
Best of both worlds IMO. Would love to hear your results if you try it
There has been a lot of work in this area. Maybe someone else could chime in.