How do i chunk PDFs with complex layout in RAG application?

I am working on a RAG based PDF Query system , specifically for complex PDFs that contains multi column tables, images, tables that span across multiple pages, tables that have images inside them.

parsing step is completed: Using PyMuPDF4llm as parser. the pdf content are being converted into markdown format after parsing. Every content of PDF including tables and images are being captured in markdown format after parsing.
Images are being downloaded and stored into a particular directory. And those images are being referenced in the markdown file.

Now i am on the chunking step, and i am stuck here. Currently I am using RecursiveCharacterTextSplitter but it can’t chunk images, it can only chunk the textual content. (Text and tables in the markdown)

I am stuck on this step, as i have no idea how do i properly chunk the images along with text data, and store them intro vector store.

Take a look at Llamaindex.
They offer a ton of different chunking solutions, including PDF chunking with image support. :hugs: