Multi-modal RAG issue with images

Merythelady · June 5, 2024, 9:50pm

Hello, I am developing a chatbot using the GPT API and RAG with langchain . The chatbot is designed to assist users in learning from a book in German and Luxembourgish. I have built vector stores with only text from the book , and the best embedding models that provide relevant documents were Bge and Voyage. Now, I need to add images from the book as illustrations (not generated ones, as I should introduce the book content to the user then generate new one ).

I tried to follow the options in the LangChain documentation, but do a summary to my images isn’t the best choice for me. Also, i tried to generate new images with DALL-E and Stable Diffusion and it didn’t provide good illustrations. In your opinion, what should I do?

I could use CLIP for multi-modal embedding, but the images are linked with text and I don’t know how to do the mapping after retrieving them. Additionally, CLIP does not perform well for German.

If anyone has any ideas or sees any misdirection, please share with me

supershaneski · June 5, 2024, 11:22pm

either you add the link to the images in your vector’s metadata or use tags in the vector’s metadata and use that to search for images.

brendan.whiting · June 6, 2024, 6:47pm

What exactly was the issue with CLIP? If my understanding is correct, you used CLIP to get image embeddings as well as text embeddings rather than Bge or Voyage, but CLIP performed worse on the German text than the other text embedding models, is that correct?

I’m also curious why summaries of the images don’t work for your use case.

Merythelady · June 9, 2024, 12:00pm

hi thank you for your time and help
my text is a dialog and the image are illustration of this dialog so the summary is like a description of the scene ( ex: two people in the parck ) and what i want exacly is to give the user the dialog with image illustration as it and then use this dialog to generate new one

for CLIP , from my point of view it will not give a good performance as my main language is german and luxemburguish (non-english)

also after the image retrieving from vector store i should pass throwgh gpt-4-vision-preview that give only textual output i want the image to be displayed

Merythelady · June 9, 2024, 12:11pm

I appreciate if you can provide more clarification about your idea you suggesting me to use the path as meta data and use a tool to display this image ?? i tried this but i didn’t succeed with gpt 4 but here the path was in the retrieved text but not as metadata

supershaneski · June 9, 2024, 11:29pm

how do you store your vector data?

for example, here’s a ChromaDB sample:

collection.add(
    documents=["page1-1 text...", "page1-2 text...", "page1-3 text....", ...],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"images": [{"url": "https://...", "alt": "..."},{"url": "https://...", "alt": "..."},...]},

Topic		Replies	Views
Knowledge Retrieval: support for PDF images Feedback knowledge-files	9	2168	October 28, 2024
Simple text embedding or CLIP for RAG? API embeddings	3	844	May 8, 2024
RAG for visual content in GPT4 vision Community gpt-4	1	2686	November 8, 2023
How chatbot can return images and/or text from my own data in PDFs? Community chatgpt	10	11817	December 14, 2023
Use own data with images for queries API chatgpt , api	9	5988	June 7, 2024

Multi-modal RAG issue with images

Related topics