What is the best way to chunk a PDF file for RAG in a smart way that preserves the meaning during retrieval?

ahmed-shaaban · February 26, 2024, 8:54am

I used Llama-Index for my RAG task and found that I can chunk my text using sentences, paragraphs, and nodes. However, I noticed that chunking sentences doesn’t save the meaning for the retrieval process, and chunking paragraphs might result in very large chunks of text. I am planning to try chunking sentences with overlapping, but I am not sure if this is the best approach. Is there a smarter way to chunk my PDF based on the meaning of the text?

SomebodySysop · February 26, 2024, 12:22pm

This is the methodology I’ve used with some success: https://youtu.be/w_veb816Asg?si=bVUs297eLSkNXY6X

keerthi03.rachamallu · May 15, 2024, 3:04pm

Hey can provide the link for the code, where I can refer to the method. Thank you!

SomebodySysop · May 30, 2024, 12:03pm

We had this very same conversation and come up with a more efficient and effective solution here: Using gpt-4 API to Semantically Chunk Documents - #95 by SomebodySysop

casimiroruperez · October 23, 2024, 2:43pm

You could also use contextual retrieval, like Antropic proposes here:
https : //www. anthropic . com /news/contextual-retrieval

ashwinaravind · October 28, 2024, 8:53am

Have you tried Vision model like ColPali for retrieval

Topic		Replies	Views
Problem with doing RAG with 300k pages of PDFs Community gpt-4 , gpt-35-turbo , api	8	6493	March 7, 2024
Need advice on chunking strategy for RAG based OpenAI chatbot Community chatgpt	0	250	October 1, 2024
Optimal way to chunk word document for RAG(semantic chunking giving bad results) Community api	5	5198	May 15, 2024
Source document chunk identification and highlighting for RAG usecase Community pdf , rag	1	3921	August 13, 2024
CHATGPT API with 200 massive PDF files API	5	1699	December 14, 2024