Using gpt-4 API to Semantically Chunk Documents

I spent some time looking at various chunking methods being promoted. These are some particularly good videos I found:

James Briggs discussion on his version of Semantic Chunking

Chunking Strategies

The 5 levels of text splitting

It’s been difficult wrapping my head around these various strategies, but it appears that the key one involves splitting a document by sentence, then using an embed model to find the cosine similarity distance between them and then “chunk” the ones that are most similar.

I assume this has the effect, as an embedding, of giving the best response that is available from the document on any particular question. However, as someone else mentioned earlier, what if two sentences are similar, but from completely different sections of the document? And, more importantly, when you return the chunk to the LLM, how does it figure out how to cite the specific document sections referenced?

In my applications, I always list the references with links to those specific areas in the document so the user can cross-check in real time. Of course, most user won’t – but they do this at their own peril.

Because of this, I prefer my own “layout-aware” chunking approach I have outlined in this thread. If I’ve done my embeddings correctly, a cosine similarity search will find similar ideas wherever they appear – and those ideas can be referenced, with links, back to the specific areas in the document(s) they are found.

Not to mention the fact that you’ve got to load and maintain a bunch of libraries with those strategies mentioned above. In my approach, there are two prompts. So long as those prompts return data in the specific structures as instructed, the rest of the code will work perfectly – now, and 5 years from now.