Simple text embedding or CLIP for RAG?

Hello everyone,
I realise this question has probably already been answered somewhere else here, but unfortunately I can’t find a clear answer…

I’d like to code a small RAG module that finds me the correct row from a data structure. A data point is composed of an image, a description of the image and a few more tags - the ultimate goal being to get the image back for further usage.

The user will perform his search by prompting. Would it be more appropriate to generate embeddings from the description only (the label), or actually from the images with their corresponding labels (using CLIP)?

It’s not exactly clear to me what are the advantages of having image+label embeds if say we’d already have detailed labeling through GPT4Vision for example.

Appreciate any help, have a good day!

I’m bumping this up, in case someone has an answer to the question :slight_smile: .

You want to retrieve an image.

If an image is not a provided input for matching, then it does not make sense to use an image embedding AI.

Embeddings based search would return some “closest”, not “correct”.

Thanks for the reply! If I’m not wrong, CLIP eventually does a semantic translation of the image features, so wouldn’t it be the same as simply getting the closest matching label (provided by GPT4V for example), that is in a same data collection as the image URL to retrieve ?