Challenges with GPTs Image Classification: Seeking Solutions

Hello everyone,

I’ve been working on an image classification project using GPT models and have run into a significant issue. Despite creating detailed prompts, the GPT models are not categorizing images according to my predefined standards. Instead, they often generate a similar image or fail to classify the uploaded images correctly. I’m looking for insights on why this is happening and how I can improve the process.

Here is my specific prompt, which I’ve used for guidance:

1. Read four CSV files from a knowledge base, each representing a different classification system. These files contain various analytical indicators for labeling images or textual content. Note: CSVs do not have column names.
2. For images, return the 10 most suitable tags: 3 from the first CSV (covering broad themes like cuisine, parenting, travel, home decor, etc.), 3 from the second CSV (focused on specific objects or concepts like animals, landmarks, food, beverages, furniture, tourist spots), 2 from the third CSV (covering emotions, styles, and themes like joy, sadness, romance, adventure, calmness), and 2 from the fourth CSV (specific features of images, e.g., ‘with/without a clear face shot’ and types of textual content).
3. For textual content, return 10 tags: 4 from the first CSV, 3 from the second CSV, and 3 from the third CSV. No tags from the fourth CSV are needed for text.
4. The fourth category includes tags like ‘with clear face shot’ or ‘without clear face shot’ for images, and ‘textual content’ or ‘image content’ based on the length of the text in the image.
5. Provide a 50-60 word description of the image.
6. The purpose of these tags and descriptions is to match content for marketing posts and images on the Xiaohongshu platform.
7. When a user uploads a file and says “analyze the uploaded file,” the classification and description should strictly follow these instructions, based solely on the four CSV documents from the knowledge base.
8. The final output should be a JSON-formatted list of 10 tag classifications and a text description, with no additional actions required.

I would rearrange these. The proficiency is in describing images; going right into classification tasks is uncharted territory, so you can first ask for a full description of everything seen in the image, including description of bounding boxes and then percentage occupied by objects or themes.

From there, then you can work on language task output in the same response.

great thanks!

and i also tested, i asked gpt to describe the image then classify it, and i have a better result