Knowledge Retrieval: support for PDF images

Feature request: please add support for including images found in a PDF file as part of the knowledge storage/retrieval. Specifically, use GPT-4 vision to extract embeddings for each image and store in vector DB, along with relevant context (e.g., “Figure 4.3: This plot shows how …”).

This would enable users to ask questions about a figure by name (“please explain figure 4.3”) or query by content (“list any figures that plot exact accuracy of MNIST”). A nice bonus would be to be able to let the API caller download the identified image (or to display the image, in the ChatGPT scenario). The overall scenario here is a user interacting with a book or paper.

2 Likes

Multimodal embeddings would certainly be nice!

I haven’t seen anyone try to extract LLaVa or CogVLM or something embeddings yet, but I’m hoping it’s just a matter of time.

I don’t know if OpenAI is actually planning on releasing a multimodal embeddings endpioint. Going by their track record I would tend to guess not, considering they just canned davinci (~gpt 3.5) embeddings. as such, gpt 4 embeddings may be unlikely.

Of course it’s possible that they have a smaller multimodal model in the pipeline :thinking:

Why do you think you need embeddings for retrieval?

That’s… literally what powers retrieval.

Hmm, the forums would not let me post a single phrase. Interesting.

How do you know?

Why do you think embeddings is the only and best way to create context?

How do you do retrievals?

1 Like

I am also curious to know if there is any other way to retrieve based on semantic similarity without using embeddings

  • full text search (heuristics or generate the query with a fast llm)
  • extract text from image using function calling and use for full text search or semantic search
  • put the whole prompt in
  • use other LLMs than openai that don’t have this issue

actually many successful AI companies I met switched from embeddings to full text for context building

A lot of AI companies are “AI” companies

I understand that using embeddings is not as trivial to use as people might hope

don't click, garbage analogy

, but going back to full text search is like throwing out the power loom because it doesn’t help you put on your socks, while your mom could both knit them and put them on your feet.

but going from word embeddings to large text embeddings to multimodal embeddings is one of the biggest deals in AI in my opinion.

I still think it’s a huge shame that they turned off davinci embeddings - that’s a step backwards.

But that is just my opinion, I understand that not everyone is as fond of the technology, and that’s fine.

Okay… I stumbled upon this thread and I agree with @Diet that the current progression in RAG - for the retrieval to work effortlessly across a wide range of documents multimodal embeddings are the next iteration. Text only retrieval is proven, but most data is image + text and having a multimodal embeddings model seems the most logical and efficient version 2 beyond OCR. I am evaluating CLIP vs ColPALI in our RAG document processing chain and would like to hear from others who are using these for their use cases.

1 Like