Knowledge Retrieval: support for PDF images

rfernand1 · January 2, 2024, 7:43pm

Feature request: please add support for including images found in a PDF file as part of the knowledge storage/retrieval. Specifically, use GPT-4 vision to extract embeddings for each image and store in vector DB, along with relevant context (e.g., “Figure 4.3: This plot shows how …”).

This would enable users to ask questions about a figure by name (“please explain figure 4.3”) or query by content (“list any figures that plot exact accuracy of MNIST”). A nice bonus would be to be able to let the API caller download the identified image (or to display the image, in the ChatGPT scenario). The overall scenario here is a user interacting with a book or paper.

Diet · January 2, 2024, 9:32pm

Multimodal embeddings would certainly be nice!

I haven’t seen anyone try to extract LLaVa or CogVLM or something embeddings yet, but I’m hoping it’s just a matter of time.

I don’t know if OpenAI is actually planning on releasing a multimodal embeddings endpioint. Going by their track record I would tend to guess not, considering they just canned davinci (~gpt 3.5) embeddings. as such, gpt 4 embeddings may be unlikely.

Of course it’s possible that they have a smaller multimodal model in the pipeline

louis030195 · January 2, 2024, 10:23pm

Why do you think you need embeddings for retrieval?

NotFenixio · January 2, 2024, 11:29pm

That’s… literally what powers retrieval.

Hmm, the forums would not let me post a single phrase. Interesting.

louis030195 · January 3, 2024, 12:49am

How do you know?

Why do you think embeddings is the only and best way to create context?

Diet · January 3, 2024, 12:57am

How do you do retrievals?

joyasree78 · January 3, 2024, 2:33am

I am also curious to know if there is any other way to retrieve based on semantic similarity without using embeddings

louis030195 · January 4, 2024, 9:18pm

full text search (heuristics or generate the query with a fast llm)
extract text from image using function calling and use for full text search or semantic search
put the whole prompt in
use other LLMs than openai that don’t have this issue

actually many successful AI companies I met switched from embeddings to full text for context building

Diet · January 5, 2024, 6:58pm

A lot of AI companies are “AI” companies

I understand that using embeddings is not as trivial to use as people might hope

don't click, garbage analogy

, but going back to full text search is like throwing out the power loom because it doesn’t help you put on your socks, while your mom could both knit them and put them on your feet.

but going from word embeddings to large text embeddings to multimodal embeddings is one of the biggest deals in AI in my opinion.

I still think it’s a huge shame that they turned off davinci embeddings - that’s a step backwards.

But that is just my opinion, I understand that not everyone is as fond of the technology, and that’s fine.

kavitatipnis · October 28, 2024, 6:17pm

Okay… I stumbled upon this thread and I agree with @Diet that the current progression in RAG - for the retrieval to work effortlessly across a wide range of documents multimodal embeddings are the next iteration. Text only retrieval is proven, but most data is image + text and having a multimodal embeddings model seems the most logical and efficient version 2 beyond OCR. I am evaluating CLIP vs ColPALI in our RAG document processing chain and would like to hear from others who are using these for their use cases.

Topic		Replies	Views
Use own data with images for queries API chatgpt , api	9	5922	June 7, 2024
Get embeddings for images API embeddings , gpt-4-vision	8	26474	February 12, 2025
Does OpenAI's Vector Store Generate Embeddings for Both Text and Images in Uploaded Files? Community embeddings	4	597	September 26, 2024
Purpose of embedding models API api	6	682	February 5, 2024
Making embeddings more accurate? API embeddings	7	2671	December 17, 2023

Knowledge Retrieval: support for PDF images

Related topics