Hello everyone,
I realise this question has probably already been answered somewhere else here, but unfortunately I can’t find a clear answer…
I’d like to code a small RAG module that finds me the correct row from a data structure. A data point is composed of an image, a description of the image and a few more tags - the ultimate goal being to get the image back for further usage.
The user will perform his search by prompting. Would it be more appropriate to generate embeddings from the description only (the label), or actually from the images with their corresponding labels (using CLIP)?
It’s not exactly clear to me what are the advantages of having image+label embeds if say we’d already have detailed labeling through GPT4Vision for example.
Appreciate any help, have a good day!