The length of the embedding contents

@wfhbrian is fully right here. The problem of chunking is losing context in the chunk that might be relevant to retrieve this chunk afterwards. This context can be classified as global context (context that refers to the whole document that the chunk belongs to) or local context (context that is in the next/previous chunks to the current one, but still relevant for the current chunk). Let me give you an actual example:

  • Imagine that you have a whole transcript of a YT video that you want to embed so you can ask questions about it and produce accurate answers. Let’s say that the transcript can be broken up in 10 chunks of 4000 characters each. Let’s say that the name of the YT video is “How to use OpenAI’s API: my top-10 strategies”. Let’s just assume that you split the transcript in the naive way and then get 10 chunks with no additional context. Now, a user asks the following question: “What is the third strategy that was mentioned in the YT video "How to use OpenAI’s API: my top-10 strategies"? How does it relate to the two previous ones?

Let’s say that the answer to this question lies on your third chunk, in a fragment such as “Ok , folks, now let's go to the third one! One thing you need to consider is BLA BLA BLA...” and on the first and second chunks (as the first and second strategies are thoroughly discussed in there).

It is likely that your semantic search will not retrieve these chunks to answer the question and, therefore, you will not be able to respond accurately. This is because of two different problems:

  • Your chunks lack global context of the doc they belong to. As you only embedded a portion of the transcription for each one of them and did not include any doc metadata, in particular they do not know that they belong to the YT video “How to use OpenAI’s API: my top-10 strategies”. So the semantic search will not trigger a similarity regarding the YT video title.

  • Your chunks lack local context of the previous (and next) chunks. Your third chunk has absolutely zero information about the previous chunks. So even if it was retrieved, it doesn’t know how to compare the third strategy with the two previous ones.

There are a lot of different strategies to face these problems, and it’s a very relevant (and hard) research topic. Something that works very well for me as a starting point:

  • Consider adding some meaningful global metadata in all the chunks that you’re embedding at the beginning of them. Something like "Document's title: How to use OpenAI's API: my top-10 strategies.mp4. Document's author: <Youtuber_name>, etc..." will do the job. This way, you’re giving global context to all your embedded chunks.
  • Consider adding local context as regards the content of the previous chunks. For example: you can create a rolling-window summary of all the previous content and propagate it through the chunks so every chunk starts with some additional info about all the previous document. This is the tricky part, as the summary needs to be short enough but still maintain meaningful info about all the previous fragments.

Hope that helps!! :slight_smile:

18 Likes