Vision API - Through Azure Blind or what am I missing?

HI All,

I’m setting up GPT-4 with Vision API through Azure. I’m sending the file in as base64 but the hallucinations are out of control and I’m not getting the same results. It’s not working at all like it does in the ChatGPT Plus in the UI.

I do get a response and I’m able to parse everything. I set the detail to high but it still doesn’t work as well. I suspect that this needs to be paired with an OCR to extract the text then send that in with the image or is this overkill? I’ve tried different image formats and i’ve even converted it to pdf but nothing is working to get it to see my image as clearly as it does in the chatgpt UI.

Please help, what am I missing?

Note: I did notice chatgpt appears to use tesseract, is it this required to get the vision to work well?

1 Like

How are you implementing this? More info like API call with the params, response received and other specific details can help understand why that’s happening.

1 Like

Hi, its an API call using the Azure endpoint. It’s a lot like this but I have detail: high also in the parameters:

payload = {
  "model": "gpt-4-vision-preview",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What’s in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}"
          }
        }
      ]
    }
  ],
  "max_tokens": 300
}

I know that it’s getting the image because when I sent in a simple image of a cat it responded that it’s a cat and the color. But when I give it an image of an order form it hallucinates and provides incorrect information that is not on the order form. When I upload that same order form in the chatgpt playground, the response is exactly correct with the information from the order form. Basically, I’m working on an automation project with order forms that are ‘noisy’ and not easily read by other OCRs. I know this vision feature can see these types of forms but not through this API? Or do I have to pair it with another text extracting OCR to make it work correctly? OR is is that the image type ‘jpg’ and it doesn’t read jpg well?

Can you also share a sample image that’s resulting in the problem?