Image selection with API - How to achieve high relevancy?

Hi everyone,

I’m trying to automate the selection & assembly of different images to illustrate a short voice over. Putting aside the cost & processing time for now, my main goal is to come up with the most relevant & accurate image selection.

Right now, despite many iterations, I still end up with something particularly bad where images are relevant only 50% (best case scenario, but usually less).

HERE IS THE PROCESS I CREATED SO FAR:

Step1: Retrieve images from a database using 2 processes:

-Specific keywords: Generate a very accurate keywords related to each paragraph of my voicer over, hoping to retrieve images particularly relevant to illustrate that specific part of the voice over.

-Broad keywords: Generate general keywords related to the voice over, hoping to retrieve additional images that could be used as a backfill solution to illustrate my over in case specific images cannot be used.

→ Approximately 600 images retrieved

Step2: Apply a broad filter to remove bad images (blur, duplicates, images with text, etc.).

→ Approximately 150 images remaining.

Step3: Send each image to GPT & retrieve a 2-3 sentences image description.

Step4: Using a combination of the following elements:

-Image description (retrieved in step 3)

-Voice over script (2 pages)

-Contextual information related to the Voice over (short document, <20 pages, containing general info about the voice over)

I send batches of 10 images to GPT, asking to exclude the most irrelevant / out of topic images.

→ Approximately 80 images remaining

Step 5: Final image selection:

Going through each paragraph of the voice over

Focusing on images obtained with specific keywords (using image description). Asking GPT to select the top 3 relevant images to illustrate the paragraph

Then, focusing on images obtained with broad keywords (using image description). Asking GPT to review the selected images & to determine if the broad images could be more relevant to replace one of the specific images.

Repeating the process a few times until reviewing all images available.

What would you recommend to improve my selection relevancy? I’m kind of out of options at this point.

Thank you in advance for your help.

1 Like

An embeddings-based textual semantic search would seem to be the best here.

The key would be to have an entity extraction and summation AI that can produce the same style of metadata transformation for either an image input or a textual input. Besides producing for an input some search keywords, the topic, the nouns depicted, you could also have a “imagine someone talking about what they just saw here, what would they say to another about how it is useful”. Some kind of significant transformation of both text and image to a common form.

Then obtain embeddings for that metadata search knowledge. This will be in a vector database, associated with the source.

Then you can use a semantic similarity search on a very large corpus of images (or even other text), with no further expense besides a single AI transformation and embeddings call of what you want to find a match for, as all images will have this embeddings meaning to be searched upon, by only algorithm.

Thank you for your reply.

I indeed considered an embedding based approach but eventually decided to go with a different option. I was indeed under the impression that an embedding system would end up with less accuracy than my current system, especially as my images are very specific to the theme I plan to cover in my voice over (for instance, a sci fi movie).

Considering that my main target is high accuracy right now, would you agree on the fact that an embedding approach would just lead to a less accurate result?

It depends on the application, how it is expected to be used what you actually do.

Classifying a bunch of images by language AI by score in a bunch of categories, or “which of these 5 pics best suits the text” asked 100 times, or other ways you could talk to an AI about 200 images vs many text segments to match doesn’t seem like a solution with a path to success.

Following my embeddings technique, your service could respond:

“here’s the best ranked 5 images from our library of thousands”

…in about 5-10 seconds.

or that selection could be what you ask the AI about, once.

Sure, let me provide more details. My main use case is the following one:

-Approximately 100 images pre-selected & filtered from a database. Each image is connected to a keyword search term that I used to search in the DB.

-Approximately 65% of the images retrieved are actually related to the keywords, 35% remaining are totally unrelated.

-Each image has very limited metadata.

-A text document split into 15 segments. Among the 65% images, only a few will be relevant to illustrate the document.

-Each text segment is associated to a specific kw search term which corresponds to a sub-set of images retrieved from the DB.

-The whole text is also associated to broader keyword search terms which are also associated to a sub-set of images retrieved from the DB.

MY MAIN GOAL:

-Select the most relevant images to illustrate each segment of the text document.

-Return the top3 images to illustrate a segment. Among the 100 images pre-selected, never use the same image twice.

-I don’t care about the processing time or the cost, accuracy is the only thing that matters.