How to extract text from images using API?

karolis6571 · January 27, 2025, 12:03am

Hello, I am a beginner in OpenAI. I am creating a project, where I want to be able to extract data from invoices as images. Now, I am stuck at extracting text from a photo. In documentation for Vision, I see that the model used is 4o-mini, and the photo was uploaded as a base64. My code, written in C#, looks like this:

var payload = new
{
    model = "gpt-4o-mini",
    messages = new[]
    {
        new
        {
            role = "user",
            content = new object[]
            {
                new { type = "text", text = "Extract all the text from the following image." },
                new { type = "image_url", url = new { data = new
                    {
                        url = $"data:image/jpeg;base64,{base64Image}"
                    } } }
                
            }
        }
    },
    max_tokens = 300
};

As a response, I get:

Error: BadRequest, {
  "error": {
    "message": "Invalid content type. image_url is only supported by certain models.",
    "type": "invalid_request_error",
    "param": "messages.[0].content.[1].type",
    "code": null
  }
}

What am I missing? Tried to look through documentations, deprecations, could not find an answer.
Can somebody help me? Thank you in advance

_j · January 27, 2025, 1:48am

You should send to the model gpt-4o-2024-11-20 or other gpt-4o.

gpt-4o-mini actually costs twice as much per image – were it to be working.

The “message” of a user will have an image (or more than one image) in this format:

{
    "role": "user",
    "content": [
        {"type": "text", "text": "Here is an image of a cat."},
        {
            "type": "image_url",
            "image_url": {
                "url": "https://example.com/cat.png",
                "detail": "high",
            },
        },
    ],
}

where URL is either an actual internet location OpenAI can retrieve or your base64 file (not raw image data).

Your C# code does not exactly match the required format for the messages array and has extra keys or structural differences. Let’s analyze the issues and suggest corrections.

Key Issues in the C# Code:

Key Mismatch:
- In the required format, the image_url object contains a url key as a direct property. Your C# code seems to wrap url under an unnecessary data object.
- Required format:
  "url": "..."
  Your code:
  "url": { "data": { "url": "..." } }
Structure Mismatch:
- The image_url object in the required format also can have the detail key (e.g., "detail": "high" or "detail": "low"), but your C# code does not include it. It is good to deliberately choose - the API defaults to high, despite documentation that says it should be “auto” with some quality in the choice.
Unnecessary Wrapping of Keys:
- The required JSON structure is simpler, with flat key-value pairs. Your C# code introduces extra nesting (e.g., { "data": { "url": ... } }), which deviates from the specification.

Corrected C# Code:

Here’s a corrected version of your C# code that matches the required JSON format for the messages array:

var payload = new
{
    model = "gpt-4o-2024-11-20",
    messages = new[]
    {
        new
        {
            role = "user",
            content = new object[]
            {
                new { type = "text", text = "Extract all the text from the following image." },
                new 
                { 
                    type = "image_url", 
                    image_url = new 
                    { 
                        url = $"data:image/jpeg;base64,{base64Image}", 
                        detail = "high" 
                    } 
                }
            }
        }
    },
    max_tokens = 1000
};

Resulting JSON Output:

The corrected C# code will produce a JSON structure like this:

{
    "model": "gpt-4o-2024-11-20",
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Extract all the text from the following image."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "data:image/jpeg;base64,...",
                        "detail": "high"
                    }
                }
            ]
        }
    ],
    "max_tokens": 1000
}

I hope that advances you closer to a solution.

karolis6571 · January 31, 2025, 9:06am

Hi,

Thank you for your response.

The code seems to be working, I am getting responses back.

But let’s say I upload the image to chatGPT model 4.
And I compare those responses, the one I get in my program and the in chat: chat response is waaay better than the one I get in my code.
What should I work on to get matching responses?

Thanks again.

Topic		Replies	Views
Make OpenAI Vision API Match GPT4 Vision API chatgpt	4	3551	December 6, 2023
Moving from gpt-4-vision-preview to gpt-4o Image URL Base64 API gpt-4 , api , gpt-4-vision	2	549	September 11, 2024
GPT-4 omni text recognition via API works worse than on chatgpt.com API gpt-4 , api	4	1072	August 13, 2024
Parse image to text with gpt-4o with ChatGpt UI and OpenAI chat.completions.create endpoint - Very Different Results API gpt-4 , chatgpt , api	3	1446	August 3, 2024
Vision API - Through Azure Blind or what am I missing? API gpt-4 , api	3	809	March 4, 2024

How to extract text from images using API?

Key Issues in the C# Code:

Corrected C# Code:

Resulting JSON Output:

Related topics