Make OpenAI Vision API Match GPT4 Vision

rai · December 5, 2023, 10:23pm

Hello. I posted an image to GPT4 to get a transcript and it was perfect. I then passed the image to the OpenAI Vision API and it was a mess even when I used the same prompt. How would I go about making the api’s performance match the chat’s?

This is my code:

import cv2
import base64
import time
import openai
import os
import requests

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

img = encode_image("1.jpg")
    
PROMPT_MESSAGES = [
    {
        "role": "user",
        "content": [
            """These is a image of a page of a book. Get all the text from the image.""",
            *map(lambda x: {"image": x, "resize": 768}, [img]),
        ],
    },
]

params = {
    "model": "gpt-4-vision-preview",
    "messages": PROMPT_MESSAGES,
    "max_tokens": 500,
}

result = openai.chat.completions.create(**params)
print(result.choices[0].message.content)

_j · December 5, 2023, 11:53pm

We don’t know how the backend preprocessing of ChatGPT works for image computer vision.

However we do know for API: the image is split into tiles if over 512 pixels in any dimension, and then a read of the main tile plus processing of the subtiles is performed.

Example, where I show a high-quality PDF-to-image rendering using Adobe tools, at the maximum size the API will allow (only 768px wide), and then demonstrate API tile size in red (although they may be evenly divided).

That may add to the confusion, along with the ultimate low resolution. GPT-4-vision for OCR is a poor use of the AI on a nearly-solved problem.

Techniques:

try at max 512 pixels to avoid tiling
try with slices, cutting a page into smaller lengths of text.

rai · December 6, 2023, 1:12am

Thanks for the detailed reply. I hear what you’re saying about GPT-4-vision being overkill, but it works so well compared to other services I tried which includes:

unstructured[dot]io
sensible[dot]so
gcp document ai
nanonets
airparser
docparser

If you have any recommendations for excellent OCR services without a lot of pre-processing of images I’d appreciate it.

_j · December 6, 2023, 1:59am

aureliodavinci · December 6, 2023, 12:31pm

Based on this, I understand it’s possible for the files ingested by GPT4 to contain images?
I have a knowledge base pdf containing a lot of screenshots I’d need to use.

Topic		Replies	Views
Vision API - Through Azure Blind or what am I missing? API gpt-4 , api	3	325	March 4, 2024
Azure OpenAI Platform Vs API API gpt4-vision , azure	0	186	March 31, 2024
GPT-4 Vision Refuses to extract Info from Images? API	31	11170	April 29, 2024
Resize parameter for gpt-4-vision-preview API	5	6787	December 10, 2023
GPT-4 API and image input API	49	57032	December 12, 2023

Make OpenAI Vision API Match GPT4 Vision

Related Topics