Using gpt-4 API to Semantically Chunk Documents

I tested a bit more in the past 1-2h but also getting nowhere near the point where this appears to be working with regular text extraction libraries. So for the time being I think this option is off the table.

If one was to use a vision model for the task, then I would try to see if this could be combined with the document outline creation step somehow and again use an approach that is based on the identification of line numbers where strikethrough text starts and ends and include in the JSON a specific flag for that. It seems to me like a waste of tokens (and money) to submit the full document twice.

This logic makes no sense actually. Unintentional pun to have a post with strikethrough text in a discussion about strikethrough text…

1 Like

I agree, this is very problematic. I can’t think of a way to combine the two calls as you absolutely need the text extraction before you can obtain the line numbers which the model absolutely needs in order to identify the precise location of the chunk segments.

I found one Stack Overflow post where someone was trying this. I probably didn’t take note because it seems like an overly complicated process (on top of an existing complicated process).

Right now, only a model (LLM,) or human, can do this.

Using GPt-4o or Claude Sonnet, yes. But using Gemini 1.5 Flash?

$0.35 / 1 million tokens (for prompts up to 128K tokens)
$0.70 / 1 million tokens (for prompts longer than 128K)

Even with our new automated Semantic Chunk process, we still initially employ the manual Semantic Chunking methodology I described a year ago: https://www.youtube.com/watch?v=w_veb816Asg&ab_channel=SwingingInTheHood

So all of our documents will fit quite comfortably in the 128K token range. Not to mention Flash being one of the fastest models available today.

Also, we have identified the documents which will most likely have the strikethrough texts: In our case, Memorandums of Understanding (“MOAs”), so we can easily assign them a different embedding configuration in our pipeline.

All that to say that using a model as a text extraction tool, at least in our case, isn’t as prohibitive as it might seem.

2 Likes

It took me almost two weeks to finally get something working. Apparently, you can NOT upload PDF files to Gemini through the Google AI Studio API. Only through the Vertex AI API.

So, I created a prompt, and modified it a gazillion times to try and get Gemini 1.5 Flash to consistently extract text EXCLUDING strikeout text, and it just wouldn’t do it. Gemini 1.5 Pro will recognize the strikeout text and and follow the prompt commands consistently.

Here is the PDF source: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_09.pdf

And this is the output from Gemini 1.5 pro: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_09_gemini_pro01.txt

Now, I’m fairly certain that GPT-4o will also do it consistently, but here’s the rub:

Prompt Token Count: 1163
Candidates Token Count: 1380
Total Token Count: 2543

Gemini Pro Pricing

$3.50 / 1 million tokens (for prompts up to 128K tokens)
$10.50 / 1 million tokens (for prompts up to 128K tokens)

OpenAI GPT-4o Pricing

$5.00 / 1M input tokens
$15.00 / 1M output tokens

I was going to look at GPT-3.5-turbo, but there still hasn’t been a response to this: Can you upload PDF files directly to OpenAI's GPT-3.5 model?

And then there is the 16K total token context and 4K output token limits (Gemini’s output limit is 8K).

So, my PDF to text extraction pipeline options are now:

  1. AWS Textract
  2. PdfToText
  3. Solr (tika)
  4. PyMuPdf (markdown)
  5. Marker (markdown)

and soon to be added:

  1. LLM (Gemini|GPT-4o)

Pretty impressive, if I must say so myself.

And, speaking of impressive, I did find an API that uses LLMs to extract text from PDFs: LlamaParse: Convert PDF (with tables) to Markdown (youtube.com)

I tried it, it works – but I could not get it to exclude strikethrough text, which is why I ended up going with Gemini. I’m sure there is (or will be soon) a way to do it, but I couldn’t figure it out.

Once I get this new extractor added to the pipeline, I think that’s going to be it. I will have my Hierarchal|Semantic Chunking pipeline, as discussed in this long thread, completed. Will post here once it’s done.

p.s. Unfortunately, in order to get Textract and Vertex AI (and PyMuPdf and Marker) working, I had to go all in with Python. The good news is that everything is installed in a Docker container, so I’ve built a template that will go in and execute the tools I need as necessary. Still wish I could have done it all in PHP, but it’s not too bad of a setup.

1 Like

I am attempting to extract tables from PDFs using GPT-3.5-turbo. Initially, instead of inputting the entire PDF as full text, I used the Python library pdfplumber to convert the PDF into text page by page and then fed it into the model. However, it tends to create tables not only from tables but also from repeated text.

Therefore, I provided the full text and specified the desired parameters in the prompt, along with detailed formatting instructions for the output file. Despite this, the output was not entirely consistent, necessitating post-processing of the output file.

I am very interested in this issue as well and look forward to sharing useful information in the future. I found your information very helpful. Thank you.

Have you tried the Marker markdown library? GitHub - VikParuchuri/marker: Convert PDF to markdown quickly with high accuracy

It’s pretty good at extracting tables. Theoretically, so is AWS Textract (though I’ve not used that feature).

Also, in my sample extraction using the actual model, https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_09_gemini_pro01.txt, note that it can be prompted to preserve tables as well.

I began this thread out of the need to try and find an automated approach to Hierarchical | Semantic chunking of documents for embeddings. I conceived the concept of “Semantic Chunking” back in early 2023, which I documented in a video: https://youtu.be/w_veb816Asg?si=NXAdb0lULG_-Y1l4

While I developed code to break down extracted text from PDFs hierarchically:

This is a document hierarchy header file, explained here: https://youtu.be/w_veb816Asg?si=hx7vo4x2vep-Muuj&t=386

It was still a fairly manual process, especially if I decided to use PDFs instead of extracted text. And, if the hierarchical chunks were larger than my chunk size limit, I still needed to break them down into smaller “sub-chunks”. For this, I continued to rely on the “rolling window” chunk methodology, which essentially cuts the document text into overlapping segments of a specific length.

What I wanted to do was develop a more automated approach which would not only preserve the document hierarchy in the embedding, but would also make sure the chunks did not exceed the chunk limit, and that those resulting “sub-chunks” were semantically organized to preserve their “atomic idea” in the resulting embeddings. The “atomic idea”, as @sergeliatko puts it, being the “ideal chunk” containing only one idea which will always match, at least theoretically, in cosine similarity searches, a similar idea posed in a RAG prompt.

I basically refer to this as “Semantic Chunking”, and my feeling was that we should be able to use GPT-4o to accomplish it.

This thread explored three approaches to “Semantic Chunking”. The first, introduced by @sergeliatko, focused on capturing the “atomic idea” on the sentence level and then building out from there. I call it the “inside out” approach. The second, mines, used a “layout aware” approach that would analyze the hierarchy of the document then drill down the hierarchical levels to capture the “atomic ideas” at the lowest levels. I call it the “outside in” approach. Finally, the @jr.2509 approach appears to fall somewhere between the other two. This is my overview summary – please feel free to offer more detailed explanations in comments to this post so that others can understand the potential benefits of each approach.

So, after much discussion and valuable ideas and insights contributed, I was able to finally come up with a process which I have implemented in my embedding pipeline:

As a failsafe measure, if at any point in this process there is a failure, it will automatically revert back to the “rolling window” approach mentioned earlier.

My embedding pipeline allows for customized configurations based upon the document classification, so this approach should work for me long-term as I can easily modify prompts, chunk sizes, extraction scripts, etc…

The one thing that would help tremendously is if OpenAI (or Google) would increase the output token limits. Right now, if I have a 750 page document, I have to write a custom script to create the first line semantic chunking described in the video. If the LLM could return more than 8K tokens, I could easily have it return a JSON file that could then be used with a default script to always be able to create document hierarchy header files. So, until then, that still has to be a manual process. But for smaller (less than 100 page) documents, it’s fully automated Baby!

If there no further discussion on this particular topic, I will mark this as the solution for me. I know @jr.2509 has mentioned expanding this out to document comparisons, but I think we should start another thread for that discussion as I would also like to discuss how to achieve more comprehensive results from RAG queries.

1 Like