How to tackle similar section names in RAG?

saurabh_harak · December 18, 2024, 1:17pm

Hello everyone,

I am working on a Retrieval-Augmented Generation (RAG) use case involving documents like HR policies. While the application performs well with distinct data in the PDF files, we’re facing two major challenges:

Handling Similar Section Names:
For example, each state has a different PTO policy, but the section names in the document (e.g., “PTO Policy”) are the same. When querying the vector database about a specific state’s PTO policy, the retrieved top-k documents often contain information from other states, making it difficult to fetch accurate results. How can we address this issue and ensure the retrieval process is contextually accurate?
Handling Large Sections:
Some topics, like specific policies, span 8-10 sections, each several pages long. When querying, we only get the first few sections, and the remaining sections are often excluded. We attempted to resolve this by increasing the number of top-k documents retrieved, but this led to inaccurate results due to the inclusion of irrelevant information.

Here are the technical details of our implementation:

We use RecursiveCharacterTextSplitter to split the documents into chunks, with a chunk size of 1000 and an overlap of 100.
The embedding model in use is the Azure OpenAI Large 3 model.

What would be the best strategies or techniques to address these challenges? Any suggestions or insights are greatly appreciated!

j.wischnat · December 18, 2024, 2:47pm

You could check out semantic chunking. Cheers.

Topic		Replies	Views
Optimal way to chunk word document for RAG(semantic chunking giving bad results) Community api	5	4246	May 15, 2024
Source document chunk identification and highlighting for RAG usecase Community pdf , rag	1	2561	August 13, 2024
Retrieval with PDFs after parsing that have very similar data and dates API rag	8	110	February 14, 2025
Need advice on chunking strategy for RAG based OpenAI chatbot Community chatgpt	0	172	October 1, 2024
What is the best way to chunk a PDF file for RAG in a smart way that preserves the meaning during retrieval? API chatgpt , rag	5	14233	October 28, 2024

How to tackle similar section names in RAG?

Related topics