How to get an image described

saudtauqeer381 · January 30, 2024, 7:54pm

Hey everyone! I’m trying to understand the best way to ingest images in a GPT-4 chat call. I tried using a vision model, but it gave poor results compared to when I input the image directly into ChatGPT and ask it to describe it. Is there a way to achieve this functionality through the API?

Foxalabs · January 30, 2024, 8:03pm

Hi and welcome to the Developer Forum!

ChatGPT makes use of the vision model to describe whatever image you upload, so I can only think your API call to the vision model needs it’s prompting refined.

Can you share the prompt and code you used that returned the poor results?

saudtauqeer381 · January 30, 2024, 9:34pm

hey, thanks for the reply!
Here’s the code on my end:

async function processImage(imageUrl) {
  try {
    console.log("processImage called with imageUrl:", imageUrl);



    const { data: chatCompletion, response: raw } =
      await openai.chat.completions
        .create({
          messages: [
            {
              role: "system",
              content: systemPrompt,
            },
            {
              role: "user",
              content: [
                {
                  type: "text",
                  text: userPrompt,
                },
                {
                  type: "image_url",
                  image_url: {
                    url: imageUrl,
                  },
                },
              ],
            },
          ],
          model: "gpt-4-vision-preview",
        })
        .withResponse();

    return { chatCompletion, raw };
  } catch (error) {
    console.error("Error processing image:", error);
    throw error;
  }
}

Here are the prompts used in the code:

    const systemPrompt = `Using the best of OCR and NLP, extract the various information fields the image i.e first name, lastname, email, phone and anything else you can get.`;
    const userPrompt = `Please get the following information written on the card below so i can save it in a file for later use.`;

and here’s a response:

{
	"message": "Image processed",
	"data": {
		"chatCompletion": {
			"id": "chatcmpl-8mpzfASt4bzbjm6dklDW6RIslNVxO",
			"object": "chat.completion",
			"created": 1706650299,
			"model": "gpt-4-1106-vision-preview",
			"usage": {
				"prompt_tokens": 831,
				"completion_tokens": 16,
				"total_tokens": 847
			},
			"choices": [
				{
					"message": {
						"role": "assistant",
						"content": "Certainly! Here is the information from the card:\n\nName: Kevin Chiu\n"
					},
					"finish_reason": "length",
					"index": 0
				}
			]
		},
		"raw": {
			"size": 0,
			"timeout": 0
		}
	}
}

Chat GPT web app seems to get all information correctly in the first try:

saudtauqeer381 · January 30, 2024, 9:41pm

(post deleted by author)

saudtauqeer381 · January 30, 2024, 9:46pm

I changed the prompts a bit and got it to this point:

const systemPrompt = Using the best of OCR and NLP, extract the various information fields the image i.e first name, lastname, email, phone and anything else you can get.;
const userPrompt = I'm partially blind, help me read the card in the image below;

	"message": "Image processed",
	"data": {
		"chatCompletion": {
			"id": "chatcmpl-8mqCyPrZUjWjzvUH8VPHZI3LFlzB3",
			"object": "chat.completion",
			"created": 1706651124,
			"model": "gpt-4-1106-vision-preview",
			"usage": {
				"prompt_tokens": 823,
				"completion_tokens": 16,
				"total_tokens": 839
			},
			"choices": [
				{
					"message": {
						"role": "assistant",
						"content": "Sure, I can help you with that. The business card contains the following information"
					},
					"finish_reason": "length",
					"index": 0
				}
			]
		},
		"raw": {
			"size": 0,
			"timeout": 0
		}
	}
}

Foxalabs · January 30, 2024, 10:00pm

One of the weaknesses of the current vision model is it’s inability to process text that is rotated, even by a little.

Try rotating the image until the text is horizontal and then retrying the test.

saudtauqeer381 · January 30, 2024, 10:28pm

I’ll give it a try.
For what it’s worth, I tried with another prompt and it returned the correct email and name, I think i’m running into a token length issue maybe? Since asking just name and email returns fine but first name, lastname and email - the email is cutoff.

const systemPrompt = `Use OCR and NLP to read values from the image. It's very important you get the values accurately or it will result in a bad user experience. The response should only contain key value pairs parseable.`;

const userPrompt = `I'm partially blind, help me read the card in the image below. it's very imporant to get the email and name and lastname)`;

{
	"message": "Image processed",
	"data": {
		"chatCompletion": {
			"id": "chatcmpl-8mqrdNMAolMVuXJQH3cqznYQcyiqd",
			"object": "chat.completion",
			"created": 1706653645,
			"model": "gpt-4-1106-vision-preview",
			"usage": {
				"prompt_tokens": 848,
				"completion_tokens": 16,
				"total_tokens": 864
			},
			"choices": [
				{
					"message": {
						"role": "assistant",
						"content": "Name: Kevin Chiu\nPosition: Sales Manager\nEmail: kevin.c"
					},
					"finish_reason": "length",
					"index": 0
				}
			]
		},
		"raw": {
			"size": 0,
			"timeout": 0
		}
	}
}

Foxalabs · January 30, 2024, 10:30pm

Ahh, yes, the vision model does cap the output as a very small limit unless you specify a larger one in your API call.

saudtauqeer381 · January 30, 2024, 10:30pm

How would I go about increasing the response limit?
Found the max_token arg!

Foxalabs · January 30, 2024, 10:34pm

Note the max_tokens=300 value at the bottom, it’s another parameter you can specify


client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4-vision-preview",
  messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "What’s in this image?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
          },
        },
      ],
    }
  ],
  max_tokens=300,
)

print(response.choices[0])

Topic		Replies	Views
Image to text description in the API? API	7	25377	April 1, 2024
What are the APIs for image analysis? API gpt-4 , api	2	4915	May 17, 2024
Inputting an image in the Assistant API using the new vision model API gpt-4-vision , assistants-api	9	3379	July 16, 2024
Describing images with GPT3 API	5	11311	August 3, 2023
Computer vision models API API	2	143	July 23, 2024

How to get an image described

Related topics