Using gpt-4 API to Semantically Chunk Documents

SomebodySysop · June 6, 2024, 3:55am

I spent some time looking at various chunking methods being promoted. These are some particularly good videos I found:

James Briggs discussion on his version of Semantic Chunking

Chunking Strategies

https://www.youtube.com/watch?v=pIGRwMjhMaQ&ab_channel=MervinPraison

The 5 levels of text splitting

https://youtu.be/8OJC21T2SL4?si=Wv1HjWQr2USmyiP-

It’s been difficult wrapping my head around these various strategies, but it appears that the key one involves splitting a document by sentence, then using an embed model to find the cosine similarity distance between them and then “chunk” the ones that are most similar.

I assume this has the effect, as an embedding, of giving the best response that is available from the document on any particular question. However, as someone else mentioned earlier, what if two sentences are similar, but from completely different sections of the document? And, more importantly, when you return the chunk to the LLM, how does it figure out how to cite the specific document sections referenced?

In my applications, I always list the references with links to those specific areas in the document so the user can cross-check in real time. Of course, most user won’t – but they do this at their own peril.

Because of this, I prefer my own “layout-aware” chunking approach I have outlined in this thread. If I’ve done my embeddings correctly, a cosine similarity search will find similar ideas wherever they appear – and those ideas can be referenced, with links, back to the specific areas in the document(s) they are found.

Not to mention the fact that you’ve got to load and maintain a bunch of libraries with those strategies mentioned above. In my approach, there are two prompts. So long as those prompts return data in the specific structures as instructed, the rest of the code will work perfectly – now, and 5 years from now.

Topic		Replies	Views
Document Sections: Better rendering of chunks for long documents Prompting vector-db , semantic-search	65	29982	September 27, 2024
The length of the embedding contents API	48	32274	December 13, 2023
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4374	January 26, 2024
⬛ Splitting / Chunking Large input text for Summarisation (greater than 4096 tokens....) API	24	43867	December 12, 2023

Using gpt-4 API to Semantically Chunk Documents

Related topics