Make OpenAI Vision API Match GPT4 Vision

Hello. I posted an image to GPT4 to get a transcript and it was perfect. I then passed the image to the OpenAI Vision API and it was a mess even when I used the same prompt. How would I go about making the api’s performance match the chat’s?

This is my code:

import cv2
import base64
import time
import openai
import os
import requests

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

img = encode_image("1.jpg")
    
PROMPT_MESSAGES = [
    {
        "role": "user",
        "content": [
            """These is a image of a page of a book. Get all the text from the image.""",
            *map(lambda x: {"image": x, "resize": 768}, [img]),
        ],
    },
]

params = {
    "model": "gpt-4-vision-preview",
    "messages": PROMPT_MESSAGES,
    "max_tokens": 500,
}

result = openai.chat.completions.create(**params)
print(result.choices[0].message.content)

We don’t know how the backend preprocessing of ChatGPT works for image computer vision.

However we do know for API: the image is split into tiles if over 512 pixels in any dimension, and then a read of the main tile plus processing of the subtiles is performed.

Example, where I show a high-quality PDF-to-image rendering using Adobe tools, at the maximum size the API will allow (only 768px wide), and then demonstrate API tile size in red (although they may be evenly divided).

That may add to the confusion, along with the ultimate low resolution. GPT-4-vision for OCR is a poor use of the AI on a nearly-solved problem.

Techniques:

  • try at max 512 pixels to avoid tiling
  • try with slices, cutting a page into smaller lengths of text.
2 Likes

Thanks for the detailed reply. I hear what you’re saying about GPT-4-vision being overkill, but it works so well compared to other services I tried which includes:

  • unstructured[dot]io
  • sensible[dot]so
  • gcp document ai
  • nanonets
  • airparser
  • docparser

If you have any recommendations for excellent OCR services without a lot of pre-processing of images I’d appreciate it.

1 Like

Based on this, I understand it’s possible for the files ingested by GPT4 to contain images?
I have a knowledge base pdf containing a lot of screenshots I’d need to use.