Hello. I posted an image to GPT4 to get a transcript and it was perfect. I then passed the image to the OpenAI Vision API and it was a mess even when I used the same prompt. How would I go about making the api’s performance match the chat’s?
This is my code:
import cv2
import base64
import time
import openai
import os
import requests
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
img = encode_image("1.jpg")
PROMPT_MESSAGES = [
{
"role": "user",
"content": [
"""These is a image of a page of a book. Get all the text from the image.""",
*map(lambda x: {"image": x, "resize": 768}, [img]),
],
},
]
params = {
"model": "gpt-4-vision-preview",
"messages": PROMPT_MESSAGES,
"max_tokens": 500,
}
result = openai.chat.completions.create(**params)
print(result.choices[0].message.content)
We don’t know how the backend preprocessing of ChatGPT works for image computer vision.
However we do know for API: the image is split into tiles if over 512 pixels in any dimension, and then a read of the main tile plus processing of the subtiles is performed.
Example, where I show a high-quality PDF-to-image rendering using Adobe tools, at the maximum size the API will allow (only 768px wide), and then demonstrate API tile size in red (although they may be evenly divided).
Thanks for the detailed reply. I hear what you’re saying about GPT-4-vision being overkill, but it works so well compared to other services I tried which includes:
unstructured[dot]io
sensible[dot]so
gcp document ai
nanonets
airparser
docparser
If you have any recommendations for excellent OCR services without a lot of pre-processing of images I’d appreciate it.
Based on this, I understand it’s possible for the files ingested by GPT4 to contain images?
I have a knowledge base pdf containing a lot of screenshots I’d need to use.