How to Optimize Text Chunking for Improved Embedding Vectorization?

I’m currently using Langchain to split my texts into chunks, but I believe it may not always yield the most optimal vectors. My source material consists of lengthy articles, which often contain contextual information distributed across the entire article. When I break these articles into smaller chunks, I run the risk of losing important contextual information.

Does anyone have any suggestions on how to enhance this process? I’ve been contemplating the idea of introducing a pre-vectorization step where I could transform all the articles into a “question and answer” format through an OpenAI request. However, I’m concerned that this approach might be costly, or perhaps there are more effective alternatives available. Any insights would be greatly appreciated.

Can you give an example of where a contextually relevant part would be missed by retrieving the top… lets say 5 chunks? By this I mean are you certain that with some chunk overlapping you are actually going to miss details vital to a correct answer? The reason I ask is that I don’t think even a human brain is able to do this, that’s fine if you are looking for super human abilities, but is it needed?

Indeed, my initial assumption might be mistaken. In my current setup, I return 2 chunks, and these 2 chunks can actually be from different articles, which, when combined, form the answer. I wasn’t aware of the option to retrieve nearby chunks from these initial ones (I’m just discovering this now), so perhaps that could be the solution to the problem presented here.

Well, you can pull back the closest K chunks to your search vector, but when you are embedding the text you can include some of the prior chunk and some of the next chunk, in effect you create a sliding window on your data that includes, for example, 25% of the prior chunk, 50% of the actual current chunk and 25% of the next chunk. This will give you cross chunk boundary relevance and context, in addition you can include metadata with each chuck, that might be the page numbers, index references, key words, anything you like to add in any information that may increase the usefulness.

You can also ask the LLM itself to propose the best search term for any given user query, this allows the model to take into account past conversational historical context and removes the user from directly influencing the search results with strange character or word use.

Curious if this process would work for potentially editing long legal documents.

Semantic Chunking

Summarizations.

And, as opposed to creating a Q & A, use the LLM to create Questions (5, 10, whatever) that each document (or document chunk) answers. Embed these questions along with the chunks and you’ve got a far more robust question answering system.

1 Like