How do i chunk PDFs with complex layout in RAG application?

shikhar.crpf · December 4, 2024, 10:33am

I am working on a RAG based PDF Query system , specifically for complex PDFs that contains multi column tables, images, tables that span across multiple pages, tables that have images inside them.

parsing step is completed: Using PyMuPDF4llm as parser. the pdf content are being converted into markdown format after parsing. Every content of PDF including tables and images are being captured in markdown format after parsing.
Images are being downloaded and stored into a particular directory. And those images are being referenced in the markdown file.

Now i am on the chunking step, and i am stuck here. Currently I am using RecursiveCharacterTextSplitter but it can’t chunk images, it can only chunk the textual content. (Text and tables in the markdown)

I am stuck on this step, as i have no idea how do i properly chunk the images along with text data, and store them intro vector store.

j.wischnat · December 4, 2024, 10:41am

Take a look at Llamaindex.
They offer a ton of different chunking solutions, including PDF chunking with image support.

Topic		Replies	Views
What is the best way to chunk a PDF file for RAG in a smart way that preserves the meaning during retrieval? API chatgpt , rag	5	12574	October 28, 2024
What is the current rag architecture of openai for pdf uploads? Community gpt-4	2	832	July 24, 2024
Need advice on chunking strategy for RAG based OpenAI chatbot Community chatgpt	0	145	October 1, 2024
Optimal way to chunk word document for RAG(semantic chunking giving bad results) Community api	5	3890	May 15, 2024
Source document chunk identification and highlighting for RAG usecase Community pdf , rag	1	2089	August 13, 2024

How do i chunk PDFs with complex layout in RAG application?

Related topics