OK, after 2 months, I’ve got a fully functional system up and running in real time.
This is the process:
- export the pdf (or whatever) document to txt.
- I am set up to use: AWS Textract, PdfToText, Solr (Tika), PyMuPDF and Marker
- run code to prepend linenoxxxx:
- send this numbered file to model along with instructions to create JSON hierarchy file
- process the JSON file with code to:
- add end_line numbers
- add token_count totals for each element
- run code on modified JSON output to create the chunks.
- semantically sub-chunk chunks that are > x tokens
- add chunks to your embedding JSON to be uploaded to vector store.
Notes:
The weakest link in this system is the model API call to create the JSON hierarchy:
- the returned JSON file could exceed the 8K model output token limit
- model sometimes gets creative and doesn’t return strict JSON
The actual language of your hierarchal chunk prompt will change depending upon your document types. This is to be expected. The semantic chunk prompt I am using, however, appears to work in most cases.
-
Hierarchal Chunk example prompt: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/hierarchalChunk.txt
-
Semantic Chunk example prompt: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/semanticChunk.txt
Needless to say, because these are working for me doesn’t mean they will work for you. You will need to modify as appropriate for your use cases.
After almost 2 years of resisting the use of Python, I finally gave in in order to be able to take advantage of the Marker and PyMuPDF markdown extractors. I run all the Python code in Docker containers.
I also needed to modify my system to allow for queued processing of text extractions as Marker markdown can take anywhere from a few minutes to over an hour.
I have discussed why I believe this process is superior to most existing methodologies here: Using gpt-4 API to Semantically Chunk Documents - #112 by SomebodySysop
This could very well change in the future, but for now, I’m pleased with the results.
I’ve already posted several examples of test inputs and outputs in this thread. Moving forward, I will continue to post examples. It is currently installed as part of my embedding pipeline, so I will get a very good picture of what works and what doesn’t work.
Many thanks to all who have contributed to this discussion, which as been a tremendous help in getting to this point.