Using gpt-4 API to Semantically Chunk Documents

sergeliatko · April 12, 2024, 6:20pm

“Ideal chunk” if we can call it like this, is more or less “atomic” - contains one idea at a time, so that your RAG wins in precision when matching vector of a usually short query (often one sentence question) to the vector of a chunk. If chunks are long, they tend to have multiple ideas in them, thus loosing the so much needed precision.

Why they have multiple ideas in them when they are longer than 3-5 paragraphs? - because humans lose their thought map at around 3-5 paragraphs, and their mind start wondering around bringing a bunch of less related ideas.

So 3-5 paragraphs, is largely under the token limit. From my experience I could add that chucks (especially in legal documents) are often closer to 1 paragraph than 3.

The approach I promote, first gets the chunks, then analyzes their purpose, and only then starts building hierarchical relations. Vs build relationships then split into chunks.

With my approach I never run into a bunch of problems resulting from a narrow token window limit. It also gains in speed because I get chunks very early and then the whole thing runs simultaneously as parallel processing.

Topic		Replies	Views
Document Sections: Better rendering of chunks for long documents Prompting vector-db , semantic-search	66	31888	April 1, 2025
The length of the embedding contents API	48	34334	December 13, 2023
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4490	January 26, 2024
⬛ Splitting / Chunking Large input text for Summarisation (greater than 4096 tokens....) API	24	45312	December 12, 2023
Poor quality response on trained LLM with pdf files Community gpt-4	29	6290	May 1, 2024

Using gpt-4 API to Semantically Chunk Documents

Related topics