How to tackle similar section names in RAG?

Hello everyone,

I am working on a Retrieval-Augmented Generation (RAG) use case involving documents like HR policies. While the application performs well with distinct data in the PDF files, we’re facing two major challenges:

  1. Handling Similar Section Names:
    For example, each state has a different PTO policy, but the section names in the document (e.g., “PTO Policy”) are the same. When querying the vector database about a specific state’s PTO policy, the retrieved top-k documents often contain information from other states, making it difficult to fetch accurate results. How can we address this issue and ensure the retrieval process is contextually accurate?
  2. Handling Large Sections:
    Some topics, like specific policies, span 8-10 sections, each several pages long. When querying, we only get the first few sections, and the remaining sections are often excluded. We attempted to resolve this by increasing the number of top-k documents retrieved, but this led to inaccurate results due to the inclusion of irrelevant information.

Here are the technical details of our implementation:

  • We use RecursiveCharacterTextSplitter to split the documents into chunks, with a chunk size of 1000 and an overlap of 100.
  • The embedding model in use is the Azure OpenAI Large 3 model.

What would be the best strategies or techniques to address these challenges? Any suggestions or insights are greatly appreciated!

1 Like

You could check out semantic chunking. Cheers. :hugs: