Using gpt-4 API to Semantically Chunk Documents

Alright, I spent a lot of my time this weekend testing semantic search approaches that build on our logic. A few lessons learned from my work so far, acknowledging that still more work is required.


The semantic search was done in the context of my work on comparative analysis between two regulatory documents, which I will just refer to as document 1 and document 2 for the remainder of this text.

I initially created a document outline with our approach for both documents. Through that approach I was able to successfully capture the individual articles/paragraphs with specific regulatory requirements in both documents.

My goal then was to identify for each article/paragraph in document 1, the relevant content from document 2 so I could compare and contrast the regulatory requirements. The tricky part is that document 2, while at the top level addressing the same topic, has a very different logical flow, with relevant content being more spread out and sitting in different places.

I tested variations of the following three approaches for semantic search using cosine similarity as the distance metric to see what yielded the most accurate matches from document 2 for a given document 1 article/paragraph:

  1. Comparing vector embeddings of document 1 and document 2 paragraphs using the actual paragraph text

  2. Comparing vector embeddings of document 1 and document paragraphs using a summary of the paragraph text

  3. Extracting key requirements / topics covered in a paragraph in the form of a simple comma separated list, converting them into a vector embedding and using this is the basis for comparison

  4. Comparing vector embeddings of document 1 paragraphs using the actual paragraph text vis-a-vis vector embeddings of individual semantic units of the paragraphs in document 2

Going into the exercise, I was reasonably confident that (1) and (2) (or a combination of both) would result in solid results. While the results were not poor, analysis however showed that often a relevant paragraph from document 2 was omitted in the identified matches, even for larger number of top returned matches (e.g. 10). I attributed that to the fact that while in principle a paragraph to me is a semantic unit, there are cases where the content covered in paragraphs - even for paragraphs of a similar size - may be more heterogeneous. It’s in those cases, where the results were flawed.

This is what led me to testing the approaches (3) and (4). Option (3) already helped to improve performance and resulted in fewer omissions. Option (4) so far, however, seemed to achieve the most accurate results, with the previously omitted content now being included. Of course, once you start matching at such a granular level, you always risk a bit losing the context. So in my intended approach, I will not only consider the identified semantic unit but also the full paragraph it is a part of. Hence, when later the model needs to perform an analysis on the content, it has all the sufficient context available.

With these preliminary findings in mind, I am now leaning towards a refined approach whereby I will evaluate sections/paragraphs identified through the hierarchy outline process for further breakdown. However, instead of doing this on the basis of the size of the identified section/paragraph, I’ll do it based on content heterogeneity or semantic coherence (for lack of a better term). My initial idea is to create a simple classification approach (using either a fine-tuned model or embeddings-based classification) to evaluate whether a section contains more than one semantic idea. If yes, I will then apply further semantic chunking and use the created semantic chunks as a basis for performing the semantic search.

3 Likes