How to Optimize Text Chunking for Improved Embedding Vectorization?

devanil · September 18, 2023, 8:56pm

I’m currently using Langchain to split my texts into chunks, but I believe it may not always yield the most optimal vectors. My source material consists of lengthy articles, which often contain contextual information distributed across the entire article. When I break these articles into smaller chunks, I run the risk of losing important contextual information.

Does anyone have any suggestions on how to enhance this process? I’ve been contemplating the idea of introducing a pre-vectorization step where I could transform all the articles into a “question and answer” format through an OpenAI request. However, I’m concerned that this approach might be costly, or perhaps there are more effective alternatives available. Any insights would be greatly appreciated.

Foxalabs · September 18, 2023, 9:04pm

Can you give an example of where a contextually relevant part would be missed by retrieving the top… lets say 5 chunks? By this I mean are you certain that with some chunk overlapping you are actually going to miss details vital to a correct answer? The reason I ask is that I don’t think even a human brain is able to do this, that’s fine if you are looking for super human abilities, but is it needed?

devanil · September 18, 2023, 9:17pm

Indeed, my initial assumption might be mistaken. In my current setup, I return 2 chunks, and these 2 chunks can actually be from different articles, which, when combined, form the answer. I wasn’t aware of the option to retrieve nearby chunks from these initial ones (I’m just discovering this now), so perhaps that could be the solution to the problem presented here.

Foxalabs · September 18, 2023, 9:32pm

Well, you can pull back the closest K chunks to your search vector, but when you are embedding the text you can include some of the prior chunk and some of the next chunk, in effect you create a sliding window on your data that includes, for example, 25% of the prior chunk, 50% of the actual current chunk and 25% of the next chunk. This will give you cross chunk boundary relevance and context, in addition you can include metadata with each chuck, that might be the page numbers, index references, key words, anything you like to add in any information that may increase the usefulness.

You can also ask the LLM itself to propose the best search term for any given user query, this allows the model to take into account past conversational historical context and removes the user from directly influencing the search results with strange character or word use.

ethlaw · October 21, 2023, 5:55am

Curious if this process would work for potentially editing long legal documents.

SomebodySysop · October 21, 2023, 6:29am

Semantic Chunking

Summarizations.

And, as opposed to creating a Q & A, use the LLM to create Questions (5, 10, whatever) that each document (or document chunk) answers. Embed these questions along with the chunks and you’ve got a far more robust question answering system.

Topic		Replies	Views
Automating Chunking for Customized GPT Knowledge in Vector Databases API embeddings , vector-db	2	1468	April 27, 2024
Embedding - text length vs accuracy? API	13	15711	December 25, 2023
Source document chunk identification and highlighting for RAG usecase Community pdf , rag	1	2475	August 13, 2024
Searching Using Vectors Derived from Long Text Segments in an Embedding Model API embeddings , api	4	2411	December 15, 2023
Best way to save html files in vector store API langchain	4	7020	October 9, 2023

How to Optimize Text Chunking for Improved Embedding Vectorization?

Related topics