Is it possible to classify the image with the multiple image input in a single API call in gpt 4 vision model? Please help this usecase

dxbbala355 · March 8, 2024, 8:35pm

I need to classify the template(content in the image like PPT slide) in a single prompt with the multiple categories. For example, In PPT, first slide have introduction slide, second one is problem description slide, third one is table, fourth one is any chart like bar, line or pie and etc, fifth one is email approval screenshots and so on these are the templates. I have to identify the template through single prompt using gpt-4-vision model. Is it possible?

Macha · March 8, 2024, 10:45pm

The short answer? No.

The long answer: .pptx is not a valid format for GPT-4 vision. Plus, each call to the vision model (at least I thought) takes a single image.

You would need to convert each frame/slide to its own image file and call gpt-4-v for each image.

stevenic · March 8, 2024, 10:52pm

So a better approach would be to identify the type of each slide individually and then pass all of the individual slide identifications to a separate model call to identify the template

supershaneski · March 8, 2024, 11:29pm

Let’s assume you have your slides as separate image files already.

So for example, I have this prompt:

this is a sample powerpoint presentation slide template. i want you to classify each slide category like introduction, problem description, table, bar chart, etc. also give a brief description of the content of each slide.

And output is:

Based on the images provided, here is a classification and brief description of each slide:

1. Slide Category: Title Slide
   Description: This slide serves as the cover page for a presentation, featuring the title "Resources, energy & agriculture visuals" and a subtitle "Editable visuals toolbox." It includes icons representing various aspects of resources, energy, and agriculture, such as a tractor, an oil pump, a factory, and livestock.

2. Slide Category: Introduction/Inspiration Slide
   Description: This slide is designed to introduce the topic and inspire the audience. It asks if the presentation is about resources, energy, or agriculture production and suggests finding inspiration to make slides more visually appealing.

3. Slide Category: Information/List Slide
   Description: This slide provides an overview of natural resource types within the context of resources, energy, and agriculture. It includes a flow of icons and text describing the stages from extraction (e.g., mining) to transport (e.g., gas pipelines), power production, crop cultivation, and livestock production.

4. Slide Category: Comparison Slide
   Description: This slide compares different types of energy resources, specifically highlighting alternative energy resources. It contrasts traditional energy resources with renewable energy and nuclear energy, using icons and color coding to differentiate between the categories.

5. Slide Category: Data Visualization/Chart Slide
   Description: This slide presents a pie chart illustrating the gross inland energy consumption by fuel in the EU-28 for the year 2010. It uses icons and percentages to show the share of different energy sources, including oil, gas, nuclear, hard coal, lignite, and renewable energy.

6. Slide Category: Process/Flowchart Slide
   Description: This slide depicts a business flow chart for food production. It outlines the process from agricultural commodity to consumer market, with an emphasis on the food value-increasing processes that occur during food production. Icons are used to represent the agricultural commodity and the consumer market.

Each slide is designed with a specific purpose in mind, using visual elements to convey information effectively and engage the audience.

Screenshot:

I am using my own custom chatbot but I am calling GPT4V for the image analysis using the prompt given in one call. Not sure if this is what you want to do.

_j · March 8, 2024, 11:59pm

Unlike what others say here, it is possible to send a set of images, and have them all considered as a whole to synthesize an answer, or to be answered about individually.

The only limitation is the cognition and coherency, especially if asking about text. You might have to do many trials to ensure the AI is giving you an answer based on the contents and not a mere overview, but you will indeed get multiple images analyzed.

Example Python (including encoded images):

from openai import OpenAI; client = OpenAI()
pngpre = 'iVBORw0KGgoAAAANSUhEUgAAAIAAAABACAMAAADlCI9NAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAAAZQTFRF////'
example_images = [
'MzMzOFSMkQAAAPJJREFUeNrslm0PwjAIhHv//09rYqZADzOBqMnu+WLTruOGvK0lhBBCCPHH4E7x3pwAfFE4tX9lAUBVwZyAYjwFAeikgH3XYxn88nzKbIZly4/BluUlIG66RVXBcYd9TTQWN+1vWUEqIJQI5nqYP6scl84UqUtEoLNMjoqBzFYrt+IF1FOTfGsqIIlcgAbNZ0Uoxtu6igB+tyBgZhCgAZ8KyI46zYQF/LksQC0L3gigdQBhgGkXou1hF1XebKzKXBxaDsjCOu1Q/LA1U+Joelt/9d2QVm9MjmibO2mGTEy2ZyetsbdLgAQIIYQQQoifcRNgAIfGAzQQHmwIAAAAAElFTkSuQmCC',
'AAAAVcLTfgAAAPRJREFUeNrsllEKwzAMQ+37X3owBm0c2VZCIYXpfXVBTd9qx5uZEEIIIcQr8IHjAgcc/LTBGwSiz5sEoIwTKwuxVCAW5XsxFco3Y63A3BawVWDMiFgiMD5tvELNuh/r5sA9Nu1yiYaXvBBLBawUAGubsZU5UOy8HkNvINoAv27nMVZ1WC1wfwrspPk2FDMiVpYknNu6uIxAVWQsgBoSCCQxI2KEANFdXccXseZzuKMQQDFmt6pPwU9CL+CcADEJr6qFA1aWYIgZEesGEVgmTsGvfYyIdaPYwp6JwBRL5kD4Hs7+VWGSz8aEEEIIIYQQ/8VHgAEAxPsD+SYeZ2QAAAAASUVORK5CYII=',
'AAAAVcLTfgAAAPVJREFUeNrslsEOhCAMRNv//+nNbtYInRELoniYdyJC2hdsATMhhBBCiFfiG4vTT1XIx/LA0wJl0hUCIeU8g2QgSBiFelJOFoCq+I3+H8ox6aN8SeGK7QvW5XfghcA+B0WcFvBDgToWbEmVANvoigBO1AIGY6N9lKuBlgAsClJ0bLME2CKaB1Kx1RcEQmWxHfK7BFhpPyHAOus+AVxW9lG7BqYJ+IHAWRHajCKE+6/YgB6B4TaMBk4EPCPgwwIG5yfEOROIp3XvxU4fRO74UGr/d3J3pt837OqAm6cl0IrQ8zAcOacbERa+s4UQQgghhBBv5iPAAA3BAvjyKYgWAAAAAElFTkSuQmCC',
]
example_images = [pngpre + s for s in example_images]
user_message = [{
"role": "user", "content": [
    "Describe the overall theme of these images",
    {"image": example_images[0]},
    {"image": example_images[1]},
    {"image": example_images[2]}, 
    ]
}]
response = client.chat.completions.with_raw_response.create(
    model="gpt-4-vision-preview", max_tokens=100, top_p=1e-9,
    messages=user_message)
print(response.parse().choices[0].message.content)

Response:

The images you’ve provided all contain the names of different types of food, specifically fruits and a vegetable. The words “Apple,” “Banana,” and “Carrot” are written in a simple, plain text font. The overall theme is related to healthy eating or fresh produce.

Topic		Replies	Views
GPT4-V: the order of multiple image inputs API gpt-4-vision	4	10005	October 26, 2024
Gpt-4 vision few shot prompting with images API	3	3310	May 29, 2024
Can GPT 4o mini model understand multiple images? API	2	736	September 18, 2024
How to compare 2 image simialrity using OPenAI api API gpt-4 , api	17	21380	October 8, 2024
How can i ask multiple questions for a set of images uploaded to gpt4 vision API gpt4-vision	0	1018	December 12, 2023

Is it possible to classify the image with the multiple image input in a single API call in gpt 4 vision model? Please help this usecase

Related topics