Using gpt-4 API to Semantically Chunk Documents

“Ideal chunk” if we can call it like this, is more or less “atomic” - contains one idea at a time, so that your RAG wins in precision when matching vector of a usually short query (often one sentence question) to the vector of a chunk. If chunks are long, they tend to have multiple ideas in them, thus loosing the so much needed precision.

Why they have multiple ideas in them when they are longer than 3-5 paragraphs? - because humans lose their thought map at around 3-5 paragraphs, and their mind start wondering around bringing a bunch of less related ideas.

So 3-5 paragraphs, is largely under the token limit. From my experience I could add that chucks (especially in legal documents) are often closer to 1 paragraph than 3.

The approach I promote, first gets the chunks, then analyzes their purpose, and only then starts building hierarchical relations. Vs build relationships then split into chunks.

With my approach I never run into a bunch of problems resulting from a narrow token window limit. It also gains in speed because I get chunks very early and then the whole thing runs simultaneously as parallel processing.

3 Likes