Does anyone have insight on whether it’s going to give better results to use fewer, longer embedding texts, or to split these texts up into more smaller fragments?
(I’m not worried much about the question of token limits, that I can handle either way.)
Here’s what I’m thinking about. I’d prefer to be able to give longer embeddings 1) to reduce the number of vectors I need to search through (when running a dot product comparison of the query against all my embedding texts) and 2) because the longer text might include relevant surrounding context that might not get captured if I split the text into many smaller sections.
The reason not to use larger texts is that I’m concerned that when calculating the dot product of my query against a longer text, there will be a lot of “noise” introduced by all the additional text, that will tend to obscure the result I want, which is to find the text that mostly closely matches my query text. But I don’t fully understand how these vector arrays are generated or what they signify, so maybe I’m not thinking about this correctly.
The general rule of thumb is to split the data based on when a paragraph finishes, as it is assumed that the surrounding information to the main text has been captured in that paragraph. However, I agree with @ruby_coder that you will have to run tests for different text lengths to see which would suit your needs the best.
Also, in terms of “to reduce the number of vectors I need to search through”, a vector database or even running on your local machine will not consume a lot of time unless the vectors are in millions and the time will be reduced even further when using a vector database like pinecone. This shouldn’t be much of a concern for you as this problem can be optimised.
And @mattcorbin needs to insure the length of the segments are not too short because embedding vectors do not work well for short phrases, keywords, etc.
@raymonddavey has suggested more than 200 to 300 words or tokens, I do not recall exactly, but I have tested extensively with short phrases and keywords and embedding perform very poorly when the text is very short.
This a super interesting question. My own personal experience with it is:
You do not need to be very concerned about the “noise addition” that might be added when you embed longer texts. Even if you’re asking a very specific question that is only answered in a tiny portion of the embedded text, the semantic search mechanism should still be able to assign a huge similarity to the pair (question, text).
However, there is still a trade-off between long-short texts. As @ruby_coder was pointing out, I’d try to avoid very short chunks because you definitely lose accuracy and context. However, very long chunks also have some issues. If you’re retrieving them to be injected in a completion/chat prompt that needs to answer the question using these texts, I feel that injecting very long and unrelated texts (except from a tiny portion of them) can make the answering module hallucinate further. It gets confused by the huge amount of information that you provide that has nothing to do with the question. Also: if you want to inject several texts into the prompt (because your answer might lie in several of them), you won’t be able to do so with very long chunks.
So, the solution that works for me goes as follows:
I do a two-steps semantic search for every question. I embed my chunks twice: with long texts (around 4k characters) and short ones (around 1k characters). When a new question comes, I firstly conduct the semantic search in the “long chunks” space. This gives me the long chunks where I should focus on.
Then, I have a classifier that determines if the question is a “general” or a “specific” question. Developing this classifier is the tricky part. But once it works, it basically determines if your question needs from generic (long context) info (such as “summarize this text for me”, or “what are the main takeaways from this text?”) or specific (short context) info (“what is the email of this customer?”)
If it’s a general question, I just try to answer it with the most relevant docs that I retrieved from the “long chunks” texts. If it’s a specific one, I conduct a second semantic search over the “short chunks” that belong to the “long chunks” that I have already pre-selected, using the “short chunks” embedding space. And I use these guys instead to try to come up with a solution.
It works reasonably well. I still feel that there are further innovations that might help on this regard. Hope that helps!!
Quick question: How do you do the classification to decide the question type?
Also, I’m assuming vector DB costs are not an issue in your case, right? (cause this essentially doubles the vector DB cost)
Another question: Are you doing any pre-processing with the user prompt? I sometimes notice that the chunks retrieved are pretty much just “grep” on the text (rather than a true semantic search). For example: Someone will be asking “What is your pricing?” and the vector search will miss obvious chunks where “cost”, “billing”, etc is mentioned. Any thoughts on that? (We use Pinecone)
Thanks for the feedback @alden. Those are all amazing questions:
The classification is done with a fine-tuned Ada model, to keep costs and latencies under control. I trained this classifier with around 2k samples of generic and specific questions. To do it, I used the data that my customers submitted to my app, to ensure that it’s tailored to my domain. Basically, I got text-davinci-003 to classify these questions for me to generate the training data. And then I used this training data to fine-tune the classifier. If you don’t have enough training data, you can also generate a synthetic dataset using a high quality model.
Yeah, vector costs are not an issue. Storing vectors is usually very cheap. So I can double the number of docs and still do not run into troubles.
About the pre-processing: that’s an extremely interesting question. In my experience, a proper preprocessing enhances the semantic search results dramatically. There are tons of suitable strategies here. For me, augmenting the context of each chunk with off-chunk info works really well. For instance: including metadata about the chunk (title of the document where the chunk comes from, author, keywords extracted via NER, short chunk/document summary, etc.) I explained this idea here: The length of the embedding contents - #7 by AgusPG