Feature request: please add support for including images found in a PDF file as part of the knowledge storage/retrieval. Specifically, use GPT-4 vision to extract embeddings for each image and store in vector DB, along with relevant context (e.g., “Figure 4.3: This plot shows how …”).
This would enable users to ask questions about a figure by name (“please explain figure 4.3”) or query by content (“list any figures that plot exact accuracy of MNIST”). A nice bonus would be to be able to let the API caller download the identified image (or to display the image, in the ChatGPT scenario). The overall scenario here is a user interacting with a book or paper.
I haven’t seen anyone try to extract LLaVa or CogVLM or something embeddings yet, but I’m hoping it’s just a matter of time.
I don’t know if OpenAI is actually planning on releasing a multimodal embeddings endpioint. Going by their track record I would tend to guess not, considering they just canned davinci (~gpt 3.5) embeddings. as such, gpt 4 embeddings may be unlikely.
Of course it’s possible that they have a smaller multimodal model in the pipeline
I understand that using embeddings is not as trivial to use as people might hope
don't click, garbage analogy
, but going back to full text search is like throwing out the power loom because it doesn’t help you put on your socks, while your mom could both knit them and put them on your feet.
but going from word embeddings to large text embeddings to multimodal embeddings is one of the biggest deals in AI in my opinion.
I still think it’s a huge shame that they turned off davinci embeddings - that’s a step backwards.
But that is just my opinion, I understand that not everyone is as fond of the technology, and that’s fine.
Okay… I stumbled upon this thread and I agree with @Diet that the current progression in RAG - for the retrieval to work effortlessly across a wide range of documents multimodal embeddings are the next iteration. Text only retrieval is proven, but most data is image + text and having a multimodal embeddings model seems the most logical and efficient version 2 beyond OCR. I am evaluating CLIP vs ColPALI in our RAG document processing chain and would like to hear from others who are using these for their use cases.