What is the best way to chunk a PDF file for RAG in a smart way that preserves the meaning during retrieval?

  • I used Llama-Index for my RAG task and found that I can chunk my text using sentences, paragraphs, and nodes. However, I noticed that chunking sentences doesn’t save the meaning for the retrieval process, and chunking paragraphs might result in very large chunks of text. I am planning to try chunking sentences with overlapping, but I am not sure if this is the best approach. Is there a smarter way to chunk my PDF based on the meaning of the text?
2 Likes

This is the methodology I’ve used with some success: https://youtu.be/w_veb816Asg?si=bVUs297eLSkNXY6X

Hey can provide the link for the code, where I can refer to the method. Thank you!

We had this very same conversation and come up with a more efficient and effective solution here: Using gpt-4 API to Semantically Chunk Documents - #95 by SomebodySysop