Using gpt-4 API to Semantically Chunk Documents

In this paragraph, you have answered why this is not possible.

However, this is precisely what RAG was created to do - train models on private data. You can refresh your embeddings as often as necessary and ensure that “undesired” content is always eliminated.

1 Like

I was not aware that OpenAI had a dynamic “vector store” as I am thinking of it – the type of customizable databases provided by Weaviate and PineCone. The OpenAI VS, as I understand it, is limited to the Assistants API, which means you can only use this API to search content uploaded to it.

Furthermore, you do not have control over metadata or search filtering in the Assistants API other than what is provided in the prompt.

Perhaps there have been some advancements in that API I’m not aware of, but from what I know, it gives you RAG but without any of the tools to fine-tune your RAG implementations.

1 Like

Drupal is far more “enterprise-ready” than WordPress. Besides, in my use case, it’s the solution that matters far more than the platform.

I agree with your views. That said, reading in between the lines, it seems there is a lot of thinking with regards to chunking and the specific features of the vector store on OpenAI’s side and I would not be surprised if we saw some form of convergence down the road towards what we are trying to do here.

1 Like

This may not be as relevant to you but one of the things I noticed as part of ongoing testing is that for the creation of the initial document outline / hierarchy I noticed some performance variability. Particularly, when the document contained a detailed table of contents, this would sometimes affect the identification of line numbers such that the model would occasionally reference the line numbers associated with the table of contents instead of the line numbers where the section starts in the document’s body. This has led me to further refine my prompt for creating the hierarchical document outline in the past few days to more specifically exclude any references to the table of contents. For now this has addressed the problem.

1 Like

Thanks for the heads up. Haven’t tested a document with TOC yet, so this is good to know.

As a result of comments by @egils , I’ve been looking at pdf to markdown extraction options. Right now testing PyMuPDF4LLM PyMuPDF4LLM — PyMuPDF4LLM documentation which I like a LOT in terms of speed of extraction. Not as clean or efficient as Marker GitHub - VikParuchuri/marker: Convert PDF to markdown quickly with high accuracy, but WAY faster.

Anyway, hoping that this might be a way to exclude TOCs (along with page numbers, headers and footers) automatically while getting better representation of tables.

1 Like

So I thought I share some details on something related that I have been wrapping my mind around and that I spent most of this weekend developing and testing.

One of my longer-term goals has been to develop an approach for automating comparative analysis (benchmarking) of regulatory requirements from across different regulatory documents on the same subject (i.e. documents on the same subject issued by two different regulatory bodies).

I believe I am now close to a first alpha version of the approach based on testing with two short and relatively contained regulatory documents from two regulatory bodies in two different jurisdictions.

My current approach involves the following sequence of steps:

  1. I use the tried and tested semantic chunking to initially create an outline of each regulatory document down to the lowest hierarchical level, whereby the lowest level typically reflects a paragraph with specific regulatory requirements. To maintain the parent - child relationships and preserve contextual information, I create “title paths” for each section. Finally, I also assign a unique identifier to each paragraph.

  2. For the text in each identified paragraph, I use a GPT-model to create a summary of the essence of the requirements captured in the paragraph and subsequently embeddings for each summary.

  3. I then perform initially a bottom-up mapping of paragraphs across the two documents based on a similarity search using cosine similarity, using the embeddings of the generated summaries. For this step, I currently use one of the two documents as the reference, i.e. for each identified paragraph in document 1, I identify the top-k closest matches from document 2 and then map the associated text of document 2 to document 1 paragraphs. I also capture any paragraphs from document 2 for which there was no match to document 1 identified.

  4. Using this mapping as a basis, I subsequently use again a GPT-4 model to define a more manageable set of categories (around 10-12). As part of this step, the model is required to perform a mapping of the paragraphs to the newly defined categories. For the mapping, a paragraph from document 1 and the identified matches from document 2 are treated as one unit. A given category may include a variable number of these units. I currently execute this in two steps: After the model creates an initial categorization and mapping, I perform a second API call during which I ask the model to validate the initial draft for completeness etc.

  5. Based on the mapping I then programmatically create a final JSON that re-inserts all the original information from the document outline created under step 1. That is, the JSON includes for each top-level category the paragraph titles and original texts from both documents.

  6. In a final step - which I have yet to execute - I then perform the actual comparative analysis of regulatory requirements by main category, using a GPT-4 model with the original text from each each regulatory document as input. The results of the analysis for each category are then documented in a structured form in a Word or PDF document.

For development and testing purposes I currently still execute the scripts for each step individually but give or take I currently look at an execution time of about 10-15 for the process once it is fully automated (based on the two documents I tested).

The two key challenges that I am s still working on to get fully right are:

(1) Getting the model to identify the most logical categories for comparison through essentially a “self-discovery” process that is heavily informed by the initial semantic chunking and mapping.

(2) Getting the matching of paragraphs right. So far, doing the matching at the granular paragraph level has yielded promising results but need to do a more detailed assessment.

Sharing it here to see if anyone has thought about something similar before and/or maybe has views on this.

1 Like

I have been studying up on “agentic rag” approaches for summarization and more comprehensive queries, and have also began thinking along the same lines of comparison queries as I deal with lots of documents which address the same or similar ideas.

I get your steps 1 and 2, but you begin to lose me at step 3: I understand going through the document 1 paragraphs to identify the closest matches in document 2, but what mechanism are you using to physically “map” one to the other? Like a key file that uses the embedding object identifiers of document 1 in one column and associated document 2 object identifiers in the second column?

And, how does this work if we’re talking about more than 2 documents?

Finally, is this a dynamic process we’re talking about, where the user submits a query and this comparison mechanism is executed on the fly? Or, is this a permanent embedding where the results are stored for recall down the line?

1 Like

Yes, I need to better articulate some of the steps.

Essentially, how I approach step 3 currently is as follows:

During step 1 and 2, I create a JSON for each document with the outline which also includes the actual text of the identified paragraphs as well as the newly generated summaries for each paragraph and their embeddings.

During step 3 I then perform a similarity search for each paragraph in document 1. The basis for search are the embeddings of the paragraph summaries. That is, for every summary embedding in document 1, I traverse the JSON for document 2 to find the closest associated embeddings there. I then programmatically create a new consolidated JSON that includes the paragraphs of document 1 and the top identified matches from document 2 including the ID, title, text and summaries for these matched paragraphs.

This newly created JSON then forms the basis for the subsequent steps. By maintaining the original title and IDs throughout the process steps, I am available to easily reconcile information from the original document outline as and when needed.

As indicated before, currently I use document 1 as a reference for the similarity search. I am also considering an alternative approach, whereby first a detailed list of categories is identified on the basis of the documents’ outlines and the mapping is then performed against these detailed categories (or a summary thereof). In a subsequent step, the categories would then still be aggregated into higher-level categories for improved manageability and analysis. In my view, this would however only work if the initially defined categories were at a similar granularity than the paragraphs. That said, this approach might be more appropriate once you start having a larger set of documents.

For the time being, I am just trying to get to a working process for two documents. Once that is up and running, I will look into scaling. Fundamentally though, I would expect the steps to remain very similar (with the caveat made above regarding the categorization process).

In the immediate future, I would be looking at ad-hoc analyses whereby the embeddings would not be permanently stored.

That said, I am also working on a more comprehensive information platform that offers analyses capabilities and for certain documents I would likely want to store the outputs of these steps more permanently, as they might then serve for a variety of different analyses. This is certainly something I have to think through in greater detail over the coming months.

1 Like

Thanks for the explanation. I think I get it now.

You’re basically creating the ability to compare two documents by creating a map of their similar texts. So any query you execute on document 1, for example, will also bring back (or have the ability to bring back) the related chunks in document 2. Or, something to that effect.

Keep us posted on your methodology. It is certainly something I’d like to investigate more.

Just to clarify: The current process is intended to produce a complete comparison of the regulatory requirements across both documents.

The idea is that a user might be interested in understanding how regulatory requirements across two jurisdictions/regulatory bodies on a given topic differ. Through the process I am designing, he/she would receive a holistic comparative analysis covering all requirements. After submission of the documents, the process would be executed entirely automatically and the end product would be a report with the findings.

The process is not yet designed/optimized for individual requests - although I am assuming you could apply a similar logic for the execution of individual requests.

1 Like

You do have control over the meta data. They offer several extra fields.

Though it is true that a Vector Store can only be attached to one Assistant, I think Files can be attached to any number of Vector Stores.

1 Like

Yes, but can you filter queries by this metadata? Do you see that documented anywhere?

No, not explicitly.

But isn’t that function implied in including Meta Data in the first place, the ability to use it as a filterable search term, I mean?

It’s sounding like data in a OpenAI Vector Store has several types of analysis applied to it whenever you add a file or make a call to it. Surely GPT4 is smart enough to know to create a filter if you asked it to when parsing the VS via that meta data.

And, even if you can’t search by the meta data today, I’ll bet dollars-to-doughnuts that will be a feature in the future. It’s clearly WAY too important to Chunking to be left out.

Another thing I don’t see, which I bet will be added, is Meta Data to actual Files. So I wonder if you could include a meta field as a comma separated text blurb at the top of one of the files of any meta data and embeddings you’d want read first.

Unknowns. The purpose of this thread is how to semantically chunk documents for RAG embeddings. OpenAI VS doesn’t, as far as I can see, allow you any control over your chunking outside of auto and static:

And both of these options are “sliding window” approaches, not “atomic ideas”, which is our goal here.

You’ll need to ask those questions in a thread more focused on the Assistants API.

OK, after 2 months, I’ve got a fully functional system up and running in real time.

This is the process:

  1. export the pdf (or whatever) document to txt.
    1. I am set up to use: AWS Textract, PdfToText, Solr (Tika), PyMuPDF and Marker
  2. run code to prepend linenoxxxx:
  3. send this numbered file to model along with instructions to create JSON hierarchy file
  4. process the JSON file with code to:
    1. add end_line numbers
    2. add token_count totals for each element
  5. run code on modified JSON output to create the chunks.
    1. semantically sub-chunk chunks that are > x tokens
  6. add chunks to your embedding JSON to be uploaded to vector store.


The weakest link in this system is the model API call to create the JSON hierarchy:

  1. the returned JSON file could exceed the 8K model output token limit
  2. model sometimes gets creative and doesn’t return strict JSON

The actual language of your hierarchal chunk prompt will change depending upon your document types. This is to be expected. The semantic chunk prompt I am using, however, appears to work in most cases.

Needless to say, because these are working for me doesn’t mean they will work for you. You will need to modify as appropriate for your use cases.

After almost 2 years of resisting the use of Python, I finally gave in in order to be able to take advantage of the Marker and PyMuPDF markdown extractors. I run all the Python code in Docker containers.

I also needed to modify my system to allow for queued processing of text extractions as Marker markdown can take anywhere from a few minutes to over an hour.

I have discussed why I believe this process is superior to most existing methodologies here: Using gpt-4 API to Semantically Chunk Documents - #112 by SomebodySysop

This could very well change in the future, but for now, I’m pleased with the results.

I’ve already posted several examples of test inputs and outputs in this thread. Moving forward, I will continue to post examples. It is currently installed as part of my embedding pipeline, so I will get a very good picture of what works and what doesn’t work.

Many thanks to all who have contributed to this discussion, which as been a tremendous help in getting to this point.


Have you considered using JSON mode? I believe it is now also available in Gemini if that is the model you are currently using. It’s certainly available for the GPT-4 models and I have been using it for my outlines, which has helped to create stability in the output.

Having used “my version” of the outline creating / chunking process now a few weeks in production as a backbone for a number of different processes, I also agree that I am seeing good and fairly consistent results.

In retrospect, realizing that we can entirely rely on start lines identification in the API call for the hierarchical segmentation, had a profound positive impact on the approach.

In any case, congrats. :slight_smile:

I continue to work on my latest project on comparative analysis as we speak and hopefully will share more details on what that looks like in practice in the coming 1-2 weeks.


Ran into my first issue. Sometimes I get PDFs that have lines of text crossed out like this:

This is my output:

I’m not sure how, or even if there is a way to handle. PdfToText and PyMuPDF create garbled text. Marker appears to remove text, but it also removed the table. Solr seems to be the only extractor that interprets the text, but then that changes the meaning of the embedding where what we really want to say is that this text is no longer valid.

Any suggestions?

I did try it with Gemini early on. Not sure why I stopped. But, that’s a good idea. Thanks!

1 Like

Ouf! That’s an interesting one. No immediate idea springs to mind but I want to give it some thought. Is your intention to keep the crossed out text for reference or do you want to eliminate it?