I am trying to create AI chatbot with all my data in PDF. Those PDF file are full of images and text. I inserted in vector database, and when I query them, it shows me only the text from PDF, not the corresponding images or figure. How to embed the images from PDF, insert them in vector data base and the query them together with text?
I want the answer from chatbot to be both image and text .
Thanks a lot!
This sounds like a problem where you actually know the answer and can find it by just explaining the details to another developer and then you will see what you need.
Kind of like Rubber duck debugging but at a higher level.
This is often how I solve some of my harder problems but I also do it with pen and paper recording the details and then scratching out parts and changing others, etc.
Hint (Click triangle to expand)
If you are using only text to search the vector database then you need to add text related to the image into the vector database and when that is found include the image.
So where will you get text for the image?
- Find associated text near the image. (Much easier said than done)
- Check for meta data in the PDF for the image
- Use OCR on the image to extract text
- Use another AI to add tags for the image such as banana, Paris, etc.
which developer? I do not know how the embedding image from my own PDF files
Not trying to be funny.
You would talk with the duck, or imaginary duck.
Don’t look at the hint but it might help you answer this question. Try to answer it on your own even if it takes hours of research or a few days of thinking, you will be happier in the end if you do. I have been working with PDFs and such for years so know what a PITA they are to get anything more than just the text.
Thanks! Can you be more detailed ?
In my PDFs, I have text and figures for better understanding the context
Did you peek at the hint?
First you have to understand that PDFs are essentially what I consider a canned website for one topic but created in a proprietary manner. If you ever take apart a PDF they have a programming language PostScript, resources like text, images, fonts, streams, etc.
If you are extracting info from a PDF and do not understand this then I take it you are using an API which is hiding the details for you to get at just the text. If so please note the details of that.
Once you are able to access the image , places to look for associated text
- Near the image which is visible in the rendered PDF
- In metadata with the image
- In the image itself which would require OCR to extract the text
- Using another AI to recognize parts of the image returning a textual description, e.g. CLIP, BLIP-2, etc.
Make sense? If not I don’t plan to write code, maybe some else can point you to something that exist in a paper or on GitHub.
For a good overview of the PDF file structure see:
Thanks a lot for your help and time!
I can tell from that response you knew something you needed when you saw it and you probably knew enough to get to what you needed, you just did not know the right keywords. I know many think PDFs are like looking at raw text files or even more complicated like RDF but they are an entirely different game. Once you know their game, which is not easy, you are able to play ball.
Ran across this paper, which is a PDF, today in my moring current events review.
“The Visual Language of Fabrics” by Valentin Deschaintre, Julia Guerrero-Viu, Diego Gutierrez, Tamy Boubekeur, Belen Masia (pdf)
Noting it here for two reasons:
- The PDF has images without text in the image but has related text with the images, think test case for your code.
- The paper notes
We introduce text2fabric, a novel dataset that links free-text descriptions to various fabric materials. The dataset comprises 15,000 natural language descriptions associated to 3,000 corresponding images of fabric materials.
If you don’t see the connection, think more abstract.
While not directly related to your question I found the related browser to be quite interesting; it showcases the effectiveness of the technology.