Content: gpt-4-vision-preview, OpenAI api.
I am running through classification of several thousand images for the purposes of book illustration, and use for a book covers. My house style book cover has a white rectangle in the lower third of the cover, which obscures 1/6th or so of the image.
To favor visually balanced covers, I want to select images where the ‘salient’ part of the image occurs in the upper-third of the image, or where character faces appear in the upper-third or upper-half of the image.
My classification returns a json with between 140 and 280 textual visual features of the image and figures in the image, and produces extremely good results, except for judgement of the image feature density position.
The ‘salience’ or prominence of eyes, or faces, or figure density needs to be in the top-half of the image, since I don’t want images where key features are underneath the white title rectangle, but gpt-4-vision consistently mis-identifies the position of this information density.
From documentation, my 3x4 or other (portrait) scale images seem to be cropped to a square, then feature extracted and sent back. The cropping seems proceed from the top of the image, because I frequently get high density information being positioned in ‘lower third’ or ‘bottom half of image’, when it would be more correctly ‘middle-third’ or ‘top half of image’.
If OpenAI teams read these message, a clarification would help on the cropping, or how it judges salience or image density. I can easily pad the image to square but before spending a day and a few hundred images to assess the mysteries of the cropping sequence, I want to know if anyone else has hit this issue, or if there are observations or documentation I could pursue.