GPT-4o Model: Image Coordinate Recognition

I need GPT-API‘s assistance in locating the coordinates for image segmentation. The image type is a middle school exam paper. Suppose there are N questions on the exam; the goal is to cut the image into N separate images, each containing one individual question and its corresponding answer. I need GPT to provide the coordinates for segmentation according to the described requirements.

Problem: GPT returned the coordinates, but the accuracy is relatively poor. I understand that sometimes GPT may provide feedback even when it cannot complete the specified task accurately. Therefore, I would like to ask the forum members if GPT-4o currently has the capability to perform this task effectively. If it has the basic capability, I will work on improving the accuracy of my usage. If not, I will try other methods to accomplish this task. Thank you very much for your help.

1 Like

No, an LLM model (even multi-modal) can’t do this right now.

Is it a common format for the exams? You should be able to cut before you send them?

1 Like

It’s not a universal format. Manually cropping a large number of images is currently not feasible. So, I’m trying to leverage GPT’s capabilities. However, I’ve made a little progress. GPT is returning coordinates to me, but it still doesn’t fully understand how to combine the question text with the handwritten answers. As a result, the coordinates aren’t very accurate.

I believe there’s potential to explore GPT’s capabilities from different perspectives. After all, it’s the product of the creativity and dedication of many brilliant minds.

Here’s a demonstration of GPT’s capabilities.
line1 = np.array([[0, 100], [512, 100]]) # Question 3 # Question 3
line2 = np.array([[0, 950], [512, 950]]) # Question 5

Therefore, I think my focus should be on crafting prompts that help GPT understand what is meant by “a question, including the printed question text and the handwritten answer.” Challenges bring progress.

Anyway, thank you very much for the response

1 Like

Is there some required value found inside the image, or their characters? Or can you just simply extract the text?

Regardless, what you are looking for are bounding boxes.

Typical OCR technologies offer this functionality.
Depending on how serious this is, you can create a powerful pipeline combining these bounding boxes with a vision model to capture the text and images found inside (or expand if the model determines content is missing)

6 Likes

Thank you very much for the reply. I don’t know much about OCR technology. But I can learn and understand through GPT. I will try the way you said. Thanks again.

2 Likes

Good idea. I’d also recommend checking Tesseract out. GPT can help you get aquatinted.

image_to_data Returns result containing box boundaries, confidences, and other information.

3 Likes

Thank you very much for your guidance. I’ve already asked GPT to help me find relevant information and explain these technical terms, and I’ve made some progress.

Do you know how I perceive GPT? Its greatest contribution is helping people raise their lower bound of ability. I didn’t have the capability to do these things before because, prior to encountering GPT, I hadn’t written a single line of code. However, I’ve known GPT for two months now, and today I’ve asked GPT to help me write nearly 10,000 lines of code. Although there were many errors, although it required numerous revisions, GPT has indeed provided me with tremendous help. I believe we should thank everyone involved in developing and expanding GPT’s capabilities. Thank you very much.

You may not be asking for my perception, but GPT provides responses based on the general knowledge contained in its training dataset.

Therefore, it can be a very useful tool for those who are not well-versed in certain subjects.
However, because the model does not inherently possess ‘correctness,’ it returns the most likely answers.
The usefulness of GPT’s answers thus depends on the dataset it has learned from and the expectations placed on the model.

From my perception, the current GPT does not sufficiently extend to making inferences.

1 Like

Nice to see you again.
I’m using GPT to help me do this. I’ve made a little bit of progress. But there is still some distance to go before success. I hope that the advancement of GPT will help me complete this work.
Thanks again for helping me.

1 Like

Question: Given that GPT can recognize images (which is a premise), why can’t it return the identified coordinates?

ChatGPT Response: While GPT-4 (multimodal) can “understand” the content of images and generate descriptions or answer questions based on visual information, it doesn’t have the ability to return pixel-level coordinates directly. This is because:

  1. Model Design and Purpose: The multimodal capabilities of GPT are primarily designed for understanding and generating natural language descriptions based on image content, not for image processing or pixel-level operations. Identifying content within an image (such as text or objects) and returning corresponding coordinates involves computer vision and image processing techniques, which are beyond the current capabilities of GPT.
  2. Data Representation: When GPT processes images, the image data is transformed into an internal representation that allows the model to understand the content, but this representation does not directly retain pixel-level coordinate information. This means the model can “understand” the content of the image and make high-level inferences but cannot directly access or manipulate specific pixel information.
  3. Difference in Tools and Applications: Returning coordinate information is more akin to the tasks handled by OCR tools or image processing algorithms, such as Tesseract or OpenCV, which are specifically designed for these tasks. GPT’s core strength lies in text generation and natural language processing based on understanding.

Therefore, while GPT can understand image content, if you need to obtain the coordinates of specific content within an image, it is recommended to use computer vision tools (such as OpenCV, Tesseract) that can precisely process images and return coordinate data.

In summary, the multimodal capabilities of GPT and the design goals of traditional computer vision tools differ, which is why it cannot return specific coordinate information.

========================================================
Here’s GPT’s explanation on this issue. Therefore, GPT cannot actually accomplish the task of providing precise coordinates. It is necessary to consider how to combine GPT’s ‘intelligence’ with the ‘precision’ of other tools like OpenCV and Tesseract.

1 Like

Question: Given that GPT can recognize images (which is a premise), why can’t it return the identified coordinates?

ChatGPT Response: While GPT-4 (multimodal) can “understand” the content of images and generate descriptions or answer questions based on visual information, it doesn’t have the ability to return pixel-level coordinates directly. This is because:

  1. Model Design and Purpose: The multimodal capabilities of GPT are primarily designed for understanding and generating natural language descriptions based on image content, not for image processing or pixel-level operations. Identifying content within an image (such as text or objects) and returning corresponding coordinates involves computer vision and image processing techniques, which are beyond the current capabilities of GPT.
  2. Data Representation: When GPT processes images, the image data is transformed into an internal representation that allows the model to understand the content, but this representation does not directly retain pixel-level coordinate information. This means the model can “understand” the content of the image and make high-level inferences but cannot directly access or manipulate specific pixel information.
  3. Difference in Tools and Applications: Returning coordinate information is more akin to the tasks handled by OCR tools or image processing algorithms, such as Tesseract or OpenCV, which are specifically designed for these tasks. GPT’s core strength lies in text generation and natural language processing based on understanding.

Therefore, while GPT can understand image content, if you need to obtain the coordinates of specific content within an image, it is recommended to use computer vision tools (such as OpenCV, Tesseract) that can precisely process images and return coordinate data.

In summary, the multimodal capabilities of GPT and the design goals of traditional computer vision tools differ, which is why it cannot return specific coordinate information.

========================================================
Here’s GPT’s explanation on this issue. Therefore, GPT cannot actually accomplish the task of providing precise coordinates. It is necessary to consider how to combine GPT’s ‘intelligence’ with the ‘precision’ of other tools like OpenCV and Tesseract.

In fact, I’m just one step away from success. I’ve already had GPT accurately identify and locate the cutting points I need.
However, due to the reasons mentioned above, it cannot provide me with the precise coordinates of these points. If I could just obtain the coordinates of these points, I would be able to complete the task.

Thank you very much.

It’s quite natural that GPT’s vision capabilities cannot return coordinates.
The very rough process of training GPT with vision capabilities is as follows:

  1. A large amount of text data (descriptions of image content).
  2. A large amount of image data.

Using these two types of data, the model is trained to associate image recognition, which may involve convolutional neural networks or other architectures, with the corresponding textual data used in training.

When an image is presented to the GPT model, it returns the textual data (descriptions of the image content) that it learned during the training process. Since pixel information is not annotated as part of the textual data describing the image, the model can only provide general descriptions related to the image.

Fei-Fei Li, who built ImageNet, might be considered a pioneer in this field.

https://www.image-net.org/

1 Like

It can*

*Sort of


an old screenshot from the lab:

Your job is to find the location of ‘the black talk to claude button’.

((gpt-4o-2024-05-13))

But it’s pretty tricky, and finnicky to boot.

Some believe that once we have proper embodied models, this stuff will become easier. You can perhaps think of it as hand-eye coordination. Babies really struggle with it, and certain neurological conditions make it more difficult for adults. And the current models are definitely missing certain human faculties.

I’m not gonna shill my own products here - I do recommend @RonaldGRuckus’s advice and go with tesseract or similar, if your use-case allows.

5 Likes

Isn’t this a task that you could do easily without AI?

Assumptions:

  1. There could be a page header where students write their names, etc.
  2. Questions can be multi-line but will have at least two lines of space between questions

Given a starting blank (background color) row in the image’s pixel array below the header (or the first row if no header), go down the rows until you hit a pixel. Now you are on the top of the first question.

Read down the rows until you find a full row of whitespace. Now you know the font height, but it might be multiline, so go down until you are sure you have passed at least two font-height rows of blank pixels. That’s the bottom of the first segment.

Continue from there to identify the remaining segments, letting the bottom of the previous question’s text be the top of the segment and the top of the next question’s text be the bottom of the segment.

Stretch Goal. Tell the AI all of this and have it write the code for you :slight_smile:

Thank you for your reply.
The difficulty here is that

  1. The format of the test paper is uncertain;
  2. It is difficult to determine the end and beginning of a question without the help of AI.
  3. It is difficult to determine which question the handwritten answer is answering.
1 Like

Yeah I was going to suggest using OCR. If you first OCR everything then GPT can easily partition out each question & answer from the extracted text.

I might be a bit worried about the quality of answers being extracted using OCR as I’m assuming these answers will be hand written

4 Likes

Thank you for your reply.
GPT can be used to extract most of the content for me. T
he reason I need to cut is that,
if I upload a whole test paper at once to extract the content. The accuracy of extraction is not good due to token limit and some other reasons I don’t know.
If I upload it after cutting and ask for GPT extraction, the accuracy will be greatly improved.

Seems like a generation / reflection cycle could work. Tell it to first find the location of the button, and then reflect on its own output(the image which highlights that button) to determine the button’s actual coordinates within the image.

1 Like

Maybe, but you’ll quickly run into a ton of issues:

what color do you make the bounding box (or cursor) so the model doesn’t confuse it with a feature? Do you do it dynamically? How do you place the cursor so it doesn’t occlude the feature? How many iterations can you keep in a prompt before you start confusing the model? etc, etc, etc…

I do encourage you to give it a shot though! Maybe you’ll have some ideas on how to generate a more robust solution.

3 Likes

You can also just ask ChatGPT to write the code for you to do what you want.

Example prompt I used:

using python, write a program that will detect chunks of text on an image with a white background. also provide image coordinates of the bounding box for each chunk of text

Returns this code:

# install this first if you haven't already
# pip install opencv-python pillow numpy

import cv2
import numpy as np
from PIL import Image

def detect_text_chunks(image_path):
    # Load the image
    image = Image.open(image_path)
    image = np.array(image)
    
    # Convert to grayscale
    gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
    
    # Threshold the image to binary (white background will become 255, text will become 0)
    _, binary = cv2.threshold(gray, 240, 255, cv2.THRESH_BINARY_INV)
    
    # Find contours
    contours, _ = cv2.findContours(binary, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    
    # Initialize a list to store bounding box coordinates
    bounding_boxes = []
    
    for contour in contours:
        # Get the bounding box for each contour
        x, y, w, h = cv2.boundingRect(contour)
        bounding_boxes.append((x, y, x+w, y+h))
    
    return bounding_boxes

# Example usage
image_path = 'path_to_your_image.png'
boxes = detect_text_chunks(image_path)

# Print out the bounding boxes
for box in boxes:
    print(f"Bounding Box: Top-left (x1={box[0]}, y1={box[1]}), Bottom-right (x2={box[2]}, y2={box[3]})")

You may need to tweak the code, but this gets you started.

3 Likes