GPT-4 Vision pre-classification Image Cropping. Salience, or facial, or character density

Content: gpt-4-vision-preview, OpenAI api.

I am running through classification of several thousand images for the purposes of book illustration, and use for a book covers. My house style book cover has a white rectangle in the lower third of the cover, which obscures 1/6th or so of the image.

To favor visually balanced covers, I want to select images where the ‘salient’ part of the image occurs in the upper-third of the image, or where character faces appear in the upper-third or upper-half of the image.

My classification returns a json with between 140 and 280 textual visual features of the image and figures in the image, and produces extremely good results, except for judgement of the image feature density position.
The ‘salience’ or prominence of eyes, or faces, or figure density needs to be in the top-half of the image, since I don’t want images where key features are underneath the white title rectangle, but gpt-4-vision consistently mis-identifies the position of this information density.

From documentation, my 3x4 or other (portrait) scale images seem to be cropped to a square, then feature extracted and sent back. The cropping seems proceed from the top of the image, because I frequently get high density information being positioned in ‘lower third’ or ‘bottom half of image’, when it would be more correctly ‘middle-third’ or ‘top half of image’.

If OpenAI teams read these message, a clarification would help on the cropping, or how it judges salience or image density. I can easily pad the image to square but before spending a day and a few hundred images to assess the mysteries of the cropping sequence, I want to know if anyone else has hit this issue, or if there are observations or documentation I could pursue.

1 Like

Yeah, it’s hard to control in my experience.

You can try things like “rule of thirds” or “suitable for book cover” tho the latter might make it an actual book cover sometimes…

I’ve been toying with a one-click cover art creator… Just building new ones not using older ones, though. Interesting… Are you just using “good” covers? How do you decide? Or just slurping in all covers? Tracking genre? Fiction / Non-Fiction?

Thanks Paul, the ‘rule of thirds’ is how the book cover is laid out visually and textually, I just need to understand of thousands of images, which ones would suit. Since gpt-4-vision is documented to crop an image to a square, I’m wondering what the specific crop area is… It is perhaps too nuanced for documentation.