Improve image processing with number "1" and "7"

Hello! I’m having an issue using the GPT4o model. It gets confused when identifying the numbers 7 and 1 for a task involving analyzing an order on a sheet of paper.
I’m doing this through the OpenAI playground with a custom assistant, and I provide this prompt:


Instructions for Image Analysis ## Objective Perform a detailed analysis of the information contained in the image, ensuring the accurate interpretation of alphanumeric characters. ## Specific Criteria 1. Analyze each character carefully and precisely. 2. If you identify the number “7,” check if it has a horizontal line in the middle: - With a line: Identify it as “7.” - Without a line: Interpret it as “1.” ## Accuracy Prioritize precision in data extraction and interpretation, ensuring the visual information is reflected as faithfully as possible.


I need it to identify the number 7 only when it has a “line” in the middle

If someone could help me, I would be very grateful.

This is an intriguing task that would face inherent limitations:

"Dear AI: Please lie about contents of images. I know that you have but 85 tokens of understanding about a passed image, but I want you to look past that, focus your attention on the appearance of the text within provided images meant for text extraction, and translate ANY number “7” that you encounter into a number “1”, unless the number 7 is also is represented by a typographical glyph for the number 7 with a horizontal strike-though bar through the middle.’

You may want to provide additional training pairs of user image input and assistant output messages before the final user image input, that could serve as training and grounding on the task.

In the examples of successfully performing the task, you’d generally want the AI responding similarly to how it does naturally, on the type of challenges you face, but you could have additional lines of reinforcement that the AI produces such as:
{"original text extraction": "Beach flip-flops $37.57", "Detected number 7s having no crossbar": True, "revised text extraction", "Beach flip-flops $31.57"}

Still, that may be an inauthentic representation of what’s going on. You may just want to demonstrate correct text extraction with the characters for 1 and 7 as they are typed, with the proper translation.

You’re not going to be able to do that.

I’d recommend using an OCR which is suited perfectly for Optical Character Recognition. Using something like GPT-4o is not only expensive, but inefficient.

Usually when I see people doing this it’s because their goal is to structure the document.

The pipeline should be

Image → Markdown (Using an OCR)
Markdown → Structure (Using an LLM)

Here’s a wonderful library created by IBM that performs the OCR you’re looking for. You can plug in your own OCR model.

If you decide to continue your path then the only solution truly would be to have a higher resolution, either by inspecting the boundary boxes or by scaling the image.

Actually, my project is based on the fact that the advisors of my magazine send the order with their codes, sizes, and quantities, and I have to process it. Is this still the best recommendation?

It’s applicable in any OCR task, as in: Any task that involves extracting the characters from an image/document.

Whatever external parameters you may have are irrelevant.

1 Like

I did this but now the model just replaces the 7’s with the 1’s, it seems like it couldn’t process the horizontal line thing.

Yes, I would expect that to be beyond the AI’s vision ability.

Consider the training on vision: labeled data such as “beach ball” or “Bill Clinton”. Then the space you are employing being “The formula is 67 + 79”. There may not be any quality of looking past existing understanding.

The only other technique would be slicing text images into small sections, so there is more image recognition and less distraction on where to focus this need for fine-grained identification.

For an ultimate test of abilities, let’s turn a well-described task for a single extracted word at high resolution over to “gpt-4-vision-preview” (before its shut-off).

image

The text “Banana” presented in the image is rendered in a sans-serif typeface. Here is a detailed analysis of each glyph:

  1. ‘B’ - The uppercase ‘B’ has a closed bowl with a straight back. The upper and lower bowls are of similar size, with the upper bowl starting at the top of the stem and the lower bowl ending at the baseline. The transition from the bowls to the vertical stem is smooth, without any brackets. The stem is of uniform thickness, and the counter shapes (the enclosed parts of the ‘B’) are rounded, suggesting a geometric influence. The terminal of the top bowl has a horizontal cut, and the bottom bowl has a similar terminal. There is no serif detailing at the ends of the strokes.

  2. ‘a’ - The lowercase ‘a’ is a single-story form, with a rounded bowl that extends above the x-height slightly before curving down into a straight, vertical stem on the right side. The bowl meets the stem with a slight curve. There is no tail or exit stroke, and the aperture (the opening at the right side of the ‘a’) is moderately wide. The stroke is consistent in weight, and the overall shape is compact and geometric.

  3. ‘n’ - The lowercase ‘n’ features a straight vertical stem on the left that extends to the baseline with a slight overshoot. The right side of the ‘n’ is a curved stroke that mirrors the bowl of the ‘a’, with a similar aperture. The transition from the stem to the curved stroke is smooth, and the stroke ends with a blunt terminal at the baseline. The x-height is consistent with the ‘a’, and the stroke weight is uniform throughout.

  4. ‘a’ - This character is identical to the first ‘a’ described, maintaining the same geometric and structural qualities.

  5. ‘n’ - This ‘n’ is a repeat of the earlier ‘n’, with no variation in form or weight, ensuring a consistent typeface design.

  6. ‘a’ - The final ‘a’ is also identical to the previous 'a’s, maintaining the uniformity of the typeface.

Overall, the font appears to be a bold or semi-bold weight due to the substantial stroke width. The characters exhibit a geometric simplicity with a modern and clean aesthetic. The consistent stroke weight, lack of serifs, and geometric quality of the glyphs suggest a typeface that could be used for display purposes or for easy readability in text. The font could be classified within the realm of geometric sans-serif typefaces, similar to fonts like Futura or Avant Garde, though it is not an exact match to either of these. The geometric influence is particularly noticeable in the single-story ‘a’ and the perfectly round counters of the ‘B’. The uniformity of stroke width and the lack of decorative elements or serifs make the font appear contemporary and straightforward.

You can see that there are things within the realm of possible but generally impractical.