Multi-modal RAG issue with images

Hello, I am developing a chatbot using the GPT API and RAG with langchain . The chatbot is designed to assist users in learning from a book in German and Luxembourgish. I have built vector stores with only text from the book , and the best embedding models that provide relevant documents were Bge and Voyage. Now, I need to add images from the book as illustrations (not generated ones, as I should introduce the book content to the user then generate new one ).

I tried to follow the options in the LangChain documentation, but do a summary to my images isn’t the best choice for me. Also, i tried to generate new images with DALL-E and Stable Diffusion and it didn’t provide good illustrations. In your opinion, what should I do?

I could use CLIP for multi-modal embedding, but the images are linked with text and I don’t know how to do the mapping after retrieving them. Additionally, CLIP does not perform well for German.

If anyone has any ideas or sees any misdirection, please share with me

either you add the link to the images in your vector’s metadata or use tags in the vector’s metadata and use that to search for images.

1 Like

What exactly was the issue with CLIP? If my understanding is correct, you used CLIP to get image embeddings as well as text embeddings rather than Bge or Voyage, but CLIP performed worse on the German text than the other text embedding models, is that correct?

I’m also curious why summaries of the images don’t work for your use case.

1 Like

hi thank you for your time and help
my text is a dialog and the image are illustration of this dialog so the summary is like a description of the scene ( ex: two people in the parck ) and what i want exacly is to give the user the dialog with image illustration as it and then use this dialog to generate new one

for CLIP , from my point of view it will not give a good performance as my main language is german and luxemburguish (non-english)

also after the image retrieving from vector store i should pass throwgh gpt-4-vision-preview that give only textual output i want the image to be displayed

I appreciate if you can provide more clarification about your idea you suggesting me to use the path as meta data and use a tool to display this image ?? i tried this but i didn’t succeed with gpt 4 but here the path was in the retrieved text but not as metadata

how do you store your vector data?

for example, here’s a ChromaDB sample:

    documents=["page1-1 text...", "page1-2 text...", "page1-3 text....", ...],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"images": [{"url": "https://...", "alt": "..."},{"url": "https://...", "alt": "..."},...]},