Hey everyone! I’m trying to understand the best way to ingest images in a GPT-4 chat call. I tried using a vision model, but it gave poor results compared to when I input the image directly into ChatGPT and ask it to describe it. Is there a way to achieve this functionality through the API?
Hi and welcome to the Developer Forum!
ChatGPT makes use of the vision model to describe whatever image you upload, so I can only think your API call to the vision model needs it’s prompting refined.
Can you share the prompt and code you used that returned the poor results?
hey, thanks for the reply!
Here’s the code on my end:
async function processImage(imageUrl) {
try {
console.log("processImage called with imageUrl:", imageUrl);
const { data: chatCompletion, response: raw } =
await openai.chat.completions
.create({
messages: [
{
role: "system",
content: systemPrompt,
},
{
role: "user",
content: [
{
type: "text",
text: userPrompt,
},
{
type: "image_url",
image_url: {
url: imageUrl,
},
},
],
},
],
model: "gpt-4-vision-preview",
})
.withResponse();
return { chatCompletion, raw };
} catch (error) {
console.error("Error processing image:", error);
throw error;
}
}
Here are the prompts used in the code:
const systemPrompt = `Using the best of OCR and NLP, extract the various information fields the image i.e first name, lastname, email, phone and anything else you can get.`;
const userPrompt = `Please get the following information written on the card below so i can save it in a file for later use.`;
and here’s a response:
{
"message": "Image processed",
"data": {
"chatCompletion": {
"id": "chatcmpl-8mpzfASt4bzbjm6dklDW6RIslNVxO",
"object": "chat.completion",
"created": 1706650299,
"model": "gpt-4-1106-vision-preview",
"usage": {
"prompt_tokens": 831,
"completion_tokens": 16,
"total_tokens": 847
},
"choices": [
{
"message": {
"role": "assistant",
"content": "Certainly! Here is the information from the card:\n\nName: Kevin Chiu\n"
},
"finish_reason": "length",
"index": 0
}
]
},
"raw": {
"size": 0,
"timeout": 0
}
}
}
Chat GPT web app seems to get all information correctly in the first try:
(post deleted by author)
I changed the prompts a bit and got it to this point:
const systemPrompt =
Using the best of OCR and NLP, extract the various information fields the image i.e first name, lastname, email, phone and anything else you can get.
;
const userPrompt =I'm partially blind, help me read the card in the image below
;
"message": "Image processed",
"data": {
"chatCompletion": {
"id": "chatcmpl-8mqCyPrZUjWjzvUH8VPHZI3LFlzB3",
"object": "chat.completion",
"created": 1706651124,
"model": "gpt-4-1106-vision-preview",
"usage": {
"prompt_tokens": 823,
"completion_tokens": 16,
"total_tokens": 839
},
"choices": [
{
"message": {
"role": "assistant",
"content": "Sure, I can help you with that. The business card contains the following information"
},
"finish_reason": "length",
"index": 0
}
]
},
"raw": {
"size": 0,
"timeout": 0
}
}
}
One of the weaknesses of the current vision model is it’s inability to process text that is rotated, even by a little.
Try rotating the image until the text is horizontal and then retrying the test.
I’ll give it a try.
For what it’s worth, I tried with another prompt and it returned the correct email and name, I think i’m running into a token length issue maybe? Since asking just name and email returns fine but first name, lastname and email - the email is cutoff.
const systemPrompt = `Use OCR and NLP to read values from the image. It's very important you get the values accurately or it will result in a bad user experience. The response should only contain key value pairs parseable.`;
const userPrompt = `I'm partially blind, help me read the card in the image below. it's very imporant to get the email and name and lastname)`;
{
"message": "Image processed",
"data": {
"chatCompletion": {
"id": "chatcmpl-8mqrdNMAolMVuXJQH3cqznYQcyiqd",
"object": "chat.completion",
"created": 1706653645,
"model": "gpt-4-1106-vision-preview",
"usage": {
"prompt_tokens": 848,
"completion_tokens": 16,
"total_tokens": 864
},
"choices": [
{
"message": {
"role": "assistant",
"content": "Name: Kevin Chiu\nPosition: Sales Manager\nEmail: kevin.c"
},
"finish_reason": "length",
"index": 0
}
]
},
"raw": {
"size": 0,
"timeout": 0
}
}
}
Ahh, yes, the vision model does cap the output as a very small limit unless you specify a larger one in your API call.
How would I go about increasing the response limit?
Found the max_token arg!
Note the max_tokens=300 value at the bottom, it’s another parameter you can specify
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4-vision-preview",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What’s in this image?"},
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
},
],
}
],
max_tokens=300,
)
print(response.choices[0])