Embedding - text length vs accuracy?

This a super interesting question. My own personal experience with it is:

  • You do not need to be very concerned about the “noise addition” that might be added when you embed longer texts. Even if you’re asking a very specific question that is only answered in a tiny portion of the embedded text, the semantic search mechanism should still be able to assign a huge similarity to the pair (question, text).

  • However, there is still a trade-off between long-short texts. As @ruby_coder was pointing out, I’d try to avoid very short chunks because you definitely lose accuracy and context. However, very long chunks also have some issues. If you’re retrieving them to be injected in a completion/chat prompt that needs to answer the question using these texts, I feel that injecting very long and unrelated texts (except from a tiny portion of them) can make the answering module hallucinate further. It gets confused by the huge amount of information that you provide that has nothing to do with the question. Also: if you want to inject several texts into the prompt (because your answer might lie in several of them), you won’t be able to do so with very long chunks.

  • So, the solution that works for me goes as follows:

    • I do a two-steps semantic search for every question. I embed my chunks twice: with long texts (around 4k characters) and short ones (around 1k characters). When a new question comes, I firstly conduct the semantic search in the “long chunks” space. This gives me the long chunks where I should focus on.

    • Then, I have a classifier that determines if the question is a “general” or a “specific” question. Developing this classifier is the tricky part. But once it works, it basically determines if your question needs from generic (long context) info (such as “summarize this text for me”, or “what are the main takeaways from this text?”) or specific (short context) info (“what is the email of this customer?”)

    • If it’s a general question, I just try to answer it with the most relevant docs that I retrieved from the “long chunks” texts. If it’s a specific one, I conduct a second semantic search over the “short chunks” that belong to the “long chunks” that I have already pre-selected, using the “short chunks” embedding space. And I use these guys instead to try to come up with a solution.

It works reasonably well. I still feel that there are further innovations that might help on this regard. Hope that helps!! :slight_smile: