Using the Vision API: best practices

Hello,

First time writing, hopefully I’m in the right place.

I’ve been using ChatGPT for some time now and I want to take it a step further and leap into the API.

I’ll be building a backend with node.js and people that are using my ERP are going to upload images, in various formats of various things, such as: bills, tickets, invoices, etc., and those images will be sent to the server in base64 format.

What I’m trying to achieve is to call the API once by backend receives the image and be able to get as a response just a JSON that gives me back the total of the bill/tickets/invoice/etc., and that’s pretty much it.

So before diving into it I’d like to hear more experienced people on what model should I use (was thinking gpt-4o-mini), and what bottle necks I might find doing this. Keep in mind that there won’t be many requests, as I only have a few hundred users. However, I’d like to use best practices and not end up with a surprising bill myself.

In the prompt I can easily do this and it works 95% of the times:
“I’ll send you pictures with images of bills or tickets. For each image I want you to answer with a JSON, with a KEY called ‘total’. For each ticket or bill I want you to just answer with that JSON and key telling me the final amount that was paid. Just answer with that, nothing more.”

Thanks in advance,

1 Like

mini = pay twice as much for the images…

Then you get to choose between gpt-4-turbo models and gpt-4o with two models. gpt-4o has more understanding of image contents, but a quite different quality in producing factual information.

If you demand a JSON output with a list and one key, you can use json_schema response format, and send a schema in your API request that the AI must adhere to, only to gpt-4o-2024-08-06

I have an AI preset with understanding of tools and schemas. Let’s give it your prompt to transform. Here’s the product (and then teaching that top level array is a no no):

{
  "name": "bill_totals",
  "strict": true,
  "schema": {
    "type": "object",
    "properties": {
      "totals": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "total": {
              "type": "number",
              "description": "The final amount paid as shown on the bill or ticket image."
            }
          },
          "required": ["total"],
          "additionalProperties": false
        }
      }
    },
    "required": ["totals"],
    "description": "For each bill or ticket image, extract the final amount paid and provide it as a JSON object with a single key 'total'.",
    "additionalProperties": false
  }
}

This schema defines a structured response with one required property, “total”, which is a number. The “strict” attribute is set to true, meaning all properties are required and must be included in the response. This will ensure that the AI only responds with the total amount paid per bill.

In this schema, the root level is an object with a single property “totals”, which is an array of objects. Each object in the array has a single property “total”, which is a number representing the final amount paid as shown on the bill or ticket image.

In this schema, the “description” field is added to both the “total” property and the overall schema. This provides context to the AI about what is expected: extracting the final amount paid from images of bills or tickets and responding with that amount as a JSON object with the key “total”.


In use:

3 Likes

First of all, thank you for the elaborate answer. I appreciate it!

I’ll will try to do the things right and use the JSON schema, but before that I want to try out - back and forth between my server and the API.

I have this, trying to use gpt-4-turbo as you suggested, but I get this error from the API:

Returned error:

 error: {
    message: 'Invalid content type. image_url is only supported by certain models.',
    type: 'invalid_request_error',
    param: 'messages.[0].content.[1].type',
    code: null
  }

EDIT: ignore this, it was my mistake.
I’ll go ahead and try the JSON schema.

Also: a schema doesn’t give a place or way to answer differently, especially if you use a number type instead of a string. The AI isn’t going to write “you sent me a picture of a rabbit” or “it’s unclear, here’s a guess”.

You can add an enum for quality string keys that you can also gather, for little cost in the same run, to ensure that the AI reports on things that may need human eyes.

Hey man, wanted to thank you for the examples and patience. :slightly_smiling_face:

I’ve updated my code and also used the JSON schema, even though I’ve changed it a bit. I’m using the gpt-4o-2024-08-06 as you suggested.

This is my schema:

{
    "name": "extracted_data",
    "description": "Extracting data from images",
    "strict": true,
    "schema": {
        "type": "object",
        "properties": {
            "type": {
                "type": "string",
                "description": "I want you to add the following: 'ALOJAMIENTO' - if the document is related to housing or hotels. 'ALIMENTACIÓN' - if the document is related to eating out, such as restaurants. 'DESPLAZAMIENTO' - if the document is related to gas stations, cab rides. And finally, 'SIN CATEGORÍA' - if you're unable to find the previous 3 categories.",
                "enum": [
                    "ALOJAMIENTO",
                    "ALIMENTACIÓN",
                    "DESPLAZAMIENTO",
                    "TIPO NO DETECTADO"
                ]
            },
            "total": {
                "type": "string",
                "description": "Under the 'total' key I want you to show the FINAL AMOUNT that was paid for the document"
            },
            "document_type": {
                "type": "string",
                "description": "Under 'document_type' key, I want you to add the type of document",
                "enum": [
                    "TICKET",
                    "FACTURA",
                    "ALBARÁN",
                    "TIPO DOCUMENTO NO DETECTADO"
                ]
            },
            "currency": {
                "type": "string",
                "description": "Under 'currency' key, I want you to add the currency with words. For example, $ = 'DÓLARES', € = 'EUROS', and so on",
                "enum": [
                    "EUROS",
                    "DÓLARES",
                    "DIVISA SIN DETECTAR"
                ]
            }
        },
        "required": [
            "type",
            "total",
            "document_type",
            "currency"
        ],
        "additionalProperties": false
    }
}

Do you foresee any problems with this? Or you think it can be improved?
I was thinking of maybe using tesseract to do my own OCR and not send to the model the image in base64 format, but the OCR text. What do you think?

Any other ideas you could have are greatly welcome!
Thanks again!

Your decision hinges on two main factors:

  1. Whether you achieve better data extraction from actual OCR text passed to the AI, as opposed to potentially losing the format and presentation which could be crucial for contextual understanding.

  2. The cost comparison between processing images versus text, in terms of token usage.

Remember, a picture may be worth a thousand words, but it could also cost as much in terms of tokens.

Additionally, consider that processing images typically results in longer response times before the first token is generated. This may not be a significant issue if you are already reaching the rate limit.

I see what you mean, I think that for now I’ll just keep seding the base64 format directly to the API for context.

Using “detail”: “low” for images it seems it has a great impact on the cost. Do you reommend using it in my case?

“detail”: “low”, as the documentation informs us, provides the AI with an image resized down to 512px on the longest dimension, and there are no “tiles” of repeated parts of the image also overlaid.

Resize your images to those dimensions and see if things are still legible.

With raw access to send an image of any size to be encoded to 85 tokens of “low”, one quickly sees that information theory holds. The amount of text you can get reproduced is less tokens than that before the hallucinations start.

Thanks for eveything, I’m all set up.
The only thing is that my users can only upload one picture at a time and I want the info to be visible inmediately, so I can’t really batch it. I’d totally do it if I could just to save some money.

You said -mini is more expensive for images. Is there a place where I can read about that? All I could find is how the -mini is faster and cheaper, but I think I’m looking in the wrong places.

Edit: You’re talking about the calculator I guess, I see what you mean then.

2 Likes

You are talking to the AI in specifications that are phrases “I want you to”, I noticed, which is a bit odd but still can work.

You can compare the performance of the current language to language that describes what the API will accept, such as “This field sets the type of image detected”, or “this JSON key receives the quality of the image, used to classify when text or which is total is unclear”. The AI should treat its output and specification compliance to be more programmatic that way.

1 Like