What is the best way to chunk a PDF file for RAG in a smart way that preserves the meaning during retrieval?

  • I used Llama-Index for my RAG task and found that I can chunk my text using sentences, paragraphs, and nodes. However, I noticed that chunking sentences doesn’t save the meaning for the retrieval process, and chunking paragraphs might result in very large chunks of text. I am planning to try chunking sentences with overlapping, but I am not sure if this is the best approach. Is there a smarter way to chunk my PDF based on the meaning of the text?

This is the methodology I’ve used with some success: https://youtu.be/w_veb816Asg?si=bVUs297eLSkNXY6X