Using gpt-4 API to Semantically Chunk Documents

SomebodySysop · June 14, 2024, 7:45am

OK, after 2 months, I’ve got a fully functional system up and running in real time.

This is the process:

export the pdf (or whatever) document to txt.
1. I am set up to use: AWS Textract, PdfToText, Solr (Tika), PyMuPDF and Marker
run code to prepend linenoxxxx:
send this numbered file to model along with instructions to create JSON hierarchy file
process the JSON file with code to:
1. add end_line numbers
2. add token_count totals for each element
run code on modified JSON output to create the chunks.
1. semantically sub-chunk chunks that are > x tokens
add chunks to your embedding JSON to be uploaded to vector store.

Notes:

The weakest link in this system is the model API call to create the JSON hierarchy:

the returned JSON file could exceed the 8K model output token limit
model sometimes gets creative and doesn’t return strict JSON

The actual language of your hierarchal chunk prompt will change depending upon your document types. This is to be expected. The semantic chunk prompt I am using, however, appears to work in most cases.

Hierarchal Chunk example prompt: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/hierarchalChunk.txt
Semantic Chunk example prompt: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/semanticChunk.txt

Needless to say, because these are working for me doesn’t mean they will work for you. You will need to modify as appropriate for your use cases.

After almost 2 years of resisting the use of Python, I finally gave in in order to be able to take advantage of the Marker and PyMuPDF markdown extractors. I run all the Python code in Docker containers.

I also needed to modify my system to allow for queued processing of text extractions as Marker markdown can take anywhere from a few minutes to over an hour.

I have discussed why I believe this process is superior to most existing methodologies here: Using gpt-4 API to Semantically Chunk Documents - #112 by SomebodySysop

This could very well change in the future, but for now, I’m pleased with the results.

I’ve already posted several examples of test inputs and outputs in this thread. Moving forward, I will continue to post examples. It is currently installed as part of my embedding pipeline, so I will get a very good picture of what works and what doesn’t work.

Many thanks to all who have contributed to this discussion, which as been a tremendous help in getting to this point.

Topic		Replies	Views
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4498	January 26, 2024
Preparing data for embedding API	33	14779	December 16, 2023
Building first RAG system API	17	659	July 6, 2025
Poor quality response on trained LLM with pdf files Community gpt-4	29	6393	May 1, 2024
API Prompt for gpt-3.5-turbo-16k API gpt-35-turbo	11	3400	January 8, 2024

Using gpt-4 API to Semantically Chunk Documents

Related topics