I am working on a feature to read blood pressure information from photos of blood pressure monitors. Some images are accurately interpreted by the API, but others are not. For example, in the attached image, the systolic blood pressure reading is 166, but the API consistently reads it as 156, sometimes as 165, and other times the error is much larger. Please help me.
wow, that’s a brave application for a preview model!
do you get a more reliable result when you improve the angle of the image?
At this stage of the vision models development, text really needs to be within a degree or two of level normal text orientation to be accurately perceived. Might be worth exploring more mature OCR software for this purpose as there already exists digital display reading code for heavy industrial automation.
I find it very interesting that this isn’t working well, I’ve tested it a bunch of times on odometer readings (mileage from a car) and it’s been 100% spot on each time.
Is it possible that the image is too large and when the model cuts it into 512 x 512 squares it’s splitting the numbers? Have you tried reducing the resolution or zooming out?
I’ve changed the perspective of the image, but it’s still incorrect.
I just straightened it and it worked, tried it twice for good measure
Edit: Just tried it through the API as well, I got 156 with the original image and a successful 166 with the straightened one. Seems like it just need the text the be straight.
The Dimension of this image is only 360x360.
It’s expected, you can read the current limitation of the vision model here: OpenAI Platform
Try it again with this one
That image is still rotated more than the acceptable limit, try straightening out the text by rotating the image.
I also tested the OCR abilities of vision-gpt on many cases and it often misreads things. I think currently googles ocr via googledocumentai is the most reliable(I tested your image and it got 100% correct, even with the slight tilt). Also it is much cheaper for mere ocr. You could run google ocr (or some other) to get the text and then give that to gpt for further processing/interpretation.
Excellent, it is now accurate. Thank you very much. I tried to align the image earlier, but it still wasn’t precise.
The vision model base tile is 512x512.
I think a human can do it at 60x60
Also, you can imagine putting some green dots at the corners of the display for auto perspective correction by other machine vision.
OCR is all about contrast (more contrast = less noise).
Analyse the text and numbers on this picture. Report them back in plain text.
The image shows a digital blood pressure monitor with the following readings displayed:
- Systolic blood pressure:
- Diastolic blood pressure:
80 beats per minute
- Memory slot indicator:
On the monitor, there are also labels and symbols indicating the type of measurement each number represents, such as “SYS.” for Systolic, “DIA.” for Diastolic, and a heart symbol for Pulse. There are buttons with symbols for power and settings adjustments as well.
You can also just higher the threshold levels.
When rotating (and adding contrast) the 5 and 6 are still “swapped”.
Correct! For the test it could be used https://scantextai.com