Using gpt-4 API to Semantically Chunk Documents

I wanted to share some real-life implications of the choice between the “sliding window” and “atomic idea” chunking during the raw text processing. Let me know what you think.

Background: @ LAWXER (where I am), we developed the “atomic idea” (see my previous posts) approach, where the raw text goes through a complex workflow to identify pieces containing one idea at a time only, and then a hierarchical tree is built with parent-child relationships established between pieces. Then, depending on the application (in our case, it is legal document analysis), the tree is transformed into “embeddable objects” that are vectorized and stored in a database (we use weaviate for the performance).

This approach is much “heavier” in development and API costs compared to the “sliding window” chunking approach, where the text is cut into overlapping pieces of a specific length (often variable) before being embedded into a vector database.

Both approaches aim to create pieces of context retrievable by vector similarity search to select a context for LLM to use during the result generation (RAG).

So, one of our competitors, https://www.termscout.com/, offers a similar solution to contract analysis that we are about to put on the market. I went to see their demos and analyze the RAG engine performance (based on the videos they shared, I know they use a “sliding window” approach).

I know a lot of things will depend on the “embeddable object” structure you use for RAG, but still, some trends can be spotted right upfront:

  1. Sliding window does not guarantee the “top” match is actually the sufficient context to produce the desired result, because it cannot guarantee the totality of the “idea” is present in the selected window, nor does it guarantee the selected chunk does not contain “the noise” (text, located close to the searched piece, but not being part of the searched piece).
    How do I know that? TermScout solution, often referred to chunks as “base for the answer,” where only partial context was present at either the beginning of the chunk or its end. But as the context was cut in a wrong spot by chunking mechanism, the produced result was wrong because of the lack of the entire context, necessary to produce the correct result.
    Why does it happen? The vector match is less precise when several ideas are present in the chunk, and depending on the surrounding text and the place where the text was cut, the match may simply fail to find the necessary parts of the context because they have a larger distance to the query than some other parts of that context present in different chunks. Like if you search the elements by a query describing the element (naturally close to the title vector), the selected chunks are likely to be short and contain the title without the stable guarantee to contain the body entirely.

  2. Slider window approach precision is not as good as “atomic idea” because several chunks are likely to contain the same parts of text mixed to other parts of the document, and there are too many factors that can affect the vectors (what I call noise). While in “atomic idea” approach, you get the whole item per chunk (and if the searched item is a “container” of chunks–say an article containing several subsections–, with the current approach to the “embeddable object” you will get the parent item + closest matches among its children on the first run, and if that’s not enough, you can select all children for the parent either on the first run or in a subsequent query).

So, what do you guys think?

4 Likes