I’ve been working with diverse datasets consisting of thousands of documents for about a year now, constantly trying to figure out how to get the best responses from the widest variety of queries. The biggest problems I’ve found so far? Noise. Cosine similarity does a great job of finding needles in haystacks, but in my experience, the smaller the haystack, the better the response.
Trust me – thousands of unstructured emails are going to have a LOT of noise. Remember that these machines are looking for semantic similarities – which will result in returned chunks that may or may not have contextual relationships with each other – i.e., noise.
I address this in two ways: 1. I try to organize my documents as much as possible according to their semantic hierarchies (i.e., Semantic Chunking) and 2. I utilize metadata and filtering as much as possible.
Emails, typically, will have just two categorization elements: Date and Subject. If you are trying to build a product support knowledgebase, date probably won’t be of much help, but the subject lines will be the only thing you have to categorize and filter your chunks. I would suggest, if possible, adding metadata to these emails like product/service, issue/complaint, resolution, etc… Being able to filter on this kind of metadata, even with data as unstructured as emails, will help reduce the size of your “haystacks” and increase the efficiency of the responses.
I mentioned PDFs because that’s what I mostly work with. You’re right, you can use any text structure you like as long as your embedding structure supports it. But think about this: When a customer asks a question and the model responds, what do you give the customer to corroborate the response? I give them links back to the source PDFs. Consider doing something similar.
BTW, you can use the AI to do the categorizations for you!
You might find these helpful:
- Retrieval-Augmented Generation for Large Language Models: A Survey
- Lessons Learned on LLM RAG Solutions
I have found this to be the best strategy, for sure, when dealing with gpt-3.5. I now configure my prompts in XML and I am getting way better responses.