Using gpt-4 API to Semantically Chunk Documents

I tested a bit more in the past 1-2h but also getting nowhere near the point where this appears to be working with regular text extraction libraries. So for the time being I think this option is off the table.

If one was to use a vision model for the task, then I would try to see if this could be combined with the document outline creation step somehow and again use an approach that is based on the identification of line numbers where strikethrough text starts and ends and include in the JSON a specific flag for that. It seems to me like a waste of tokens (and money) to submit the full document twice.

This logic makes no sense actually. Unintentional pun to have a post with strikethrough text in a discussion about strikethrough text…

1 Like

I agree, this is very problematic. I can’t think of a way to combine the two calls as you absolutely need the text extraction before you can obtain the line numbers which the model absolutely needs in order to identify the precise location of the chunk segments.

I found one Stack Overflow post where someone was trying this. I probably didn’t take note because it seems like an overly complicated process (on top of an existing complicated process).

Right now, only a model (LLM,) or human, can do this.

Using GPt-4o or Claude Sonnet, yes. But using Gemini 1.5 Flash?

$0.35 / 1 million tokens (for prompts up to 128K tokens)
$0.70 / 1 million tokens (for prompts longer than 128K)

Even with our new automated Semantic Chunk process, we still initially employ the manual Semantic Chunking methodology I described a year ago: https://www.youtube.com/watch?v=w_veb816Asg&ab_channel=SwingingInTheHood

So all of our documents will fit quite comfortably in the 128K token range. Not to mention Flash being one of the fastest models available today.

Also, we have identified the documents which will most likely have the strikethrough texts: In our case, Memorandums of Understanding (“MOAs”), so we can easily assign them a different embedding configuration in our pipeline.

All that to say that using a model as a text extraction tool, at least in our case, isn’t as prohibitive as it might seem.

2 Likes

It took me almost two weeks to finally get something working. Apparently, you can NOT upload PDF files to Gemini through the Google AI Studio API. Only through the Vertex AI API.

So, I created a prompt, and modified it a gazillion times to try and get Gemini 1.5 Flash to consistently extract text EXCLUDING strikeout text, and it just wouldn’t do it. Gemini 1.5 Pro will recognize the strikeout text and and follow the prompt commands consistently.

Here is the PDF source: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_09.pdf

And this is the output from Gemini 1.5 pro: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_09_gemini_pro01.txt

Now, I’m fairly certain that GPT-4o will also do it consistently, but here’s the rub:

Prompt Token Count: 1163
Candidates Token Count: 1380
Total Token Count: 2543

Gemini Pro Pricing

$3.50 / 1 million tokens (for prompts up to 128K tokens)
$10.50 / 1 million tokens (for prompts up to 128K tokens)

OpenAI GPT-4o Pricing

$5.00 / 1M input tokens
$15.00 / 1M output tokens

I was going to look at GPT-3.5-turbo, but there still hasn’t been a response to this: Can you upload PDF files directly to OpenAI's GPT-3.5 model?

And then there is the 16K total token context and 4K output token limits (Gemini’s output limit is 8K).

So, my PDF to text extraction pipeline options are now:

  1. AWS Textract
  2. PdfToText
  3. Solr (tika)
  4. PyMuPdf (markdown)
  5. Marker (markdown)

and soon to be added:

  1. LLM (Gemini|GPT-4o)

Pretty impressive, if I must say so myself.

And, speaking of impressive, I did find an API that uses LLMs to extract text from PDFs: LlamaParse: Convert PDF (with tables) to Markdown (youtube.com)

I tried it, it works – but I could not get it to exclude strikethrough text, which is why I ended up going with Gemini. I’m sure there is (or will be soon) a way to do it, but I couldn’t figure it out.

Once I get this new extractor added to the pipeline, I think that’s going to be it. I will have my Hierarchal|Semantic Chunking pipeline, as discussed in this long thread, completed. Will post here once it’s done.

p.s. Unfortunately, in order to get Textract and Vertex AI (and PyMuPdf and Marker) working, I had to go all in with Python. The good news is that everything is installed in a Docker container, so I’ve built a template that will go in and execute the tools I need as necessary. Still wish I could have done it all in PHP, but it’s not too bad of a setup.

1 Like

I am attempting to extract tables from PDFs using GPT-3.5-turbo. Initially, instead of inputting the entire PDF as full text, I used the Python library pdfplumber to convert the PDF into text page by page and then fed it into the model. However, it tends to create tables not only from tables but also from repeated text.

Therefore, I provided the full text and specified the desired parameters in the prompt, along with detailed formatting instructions for the output file. Despite this, the output was not entirely consistent, necessitating post-processing of the output file.

I am very interested in this issue as well and look forward to sharing useful information in the future. I found your information very helpful. Thank you.

Have you tried the Marker markdown library? GitHub - VikParuchuri/marker: Convert PDF to markdown quickly with high accuracy

It’s pretty good at extracting tables. Theoretically, so is AWS Textract (though I’ve not used that feature).

Also, in my sample extraction using the actual model, https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_09_gemini_pro01.txt, note that it can be prompted to preserve tables as well.

1 Like

I began this thread out of the need to try and find an automated approach to Hierarchical | Semantic chunking of documents for embeddings. I conceived the concept of “Semantic Chunking” back in early 2023, which I documented in a video: https://youtu.be/w_veb816Asg?si=NXAdb0lULG_-Y1l4

While I developed code to break down extracted text from PDFs hierarchically:

This is a document hierarchy header file, explained here: https://youtu.be/w_veb816Asg?si=hx7vo4x2vep-Muuj&t=386

It was still a fairly manual process, especially if I decided to use PDFs instead of extracted text. And, if the hierarchical chunks were larger than my chunk size limit, I still needed to break them down into smaller “sub-chunks”. For this, I continued to rely on the “rolling window” chunk methodology, which essentially cuts the document text into overlapping segments of a specific length.

What I wanted to do was develop a more automated approach which would not only preserve the document hierarchy in the embedding, but would also make sure the chunks did not exceed the chunk limit, and that those resulting “sub-chunks” were semantically organized to preserve their “atomic idea” in the resulting embeddings. The “atomic idea”, as @sergeliatko puts it, being the “ideal chunk” containing only one idea which will always match, at least theoretically, in cosine similarity searches, a similar idea posed in a RAG prompt.

I basically refer to this as “Semantic Chunking”, and my feeling was that we should be able to use GPT-4o to accomplish it.

This thread explored three approaches to “Semantic Chunking”. The first, introduced by @sergeliatko, focused on capturing the “atomic idea” on the sentence level and then building out from there. I call it the “inside out” approach. The second, mines, used a “layout aware” approach that would analyze the hierarchy of the document then drill down the hierarchical levels to capture the “atomic ideas” at the lowest levels. I call it the “outside in” approach. Finally, the @jr.2509 approach appears to fall somewhere between the other two. This is my overview summary – please feel free to offer more detailed explanations in comments to this post so that others can understand the potential benefits of each approach.

So, after much discussion and valuable ideas and insights contributed, I was able to finally come up with a process which I have implemented in my embedding pipeline:

As a failsafe measure, if at any point in this process there is a failure, it will automatically revert back to the “rolling window” approach mentioned earlier.

My embedding pipeline allows for customized configurations based upon the document classification, so this approach should work for me long-term as I can easily modify prompts, chunk sizes, extraction scripts, etc…

The one thing that would help tremendously is if OpenAI (or Google) would increase the output token limits. Right now, if I have a 750 page document, I have to write a custom script to create the first line semantic chunking described in the video. If the LLM could return more than 8K tokens, I could easily have it return a JSON file that could then be used with a default script to always be able to create document hierarchy header files. So, until then, that still has to be a manual process. But for smaller (less than 100 page) documents, it’s fully automated Baby!

If there no further discussion on this particular topic, I will mark this as the solution for me. I know @jr.2509 has mentioned expanding this out to document comparisons, but I think we should start another thread for that discussion as I would also like to discuss how to achieve more comprehensive results from RAG queries.

5 Likes

With a different approach you might get away with 4k max tokens limit (my case).

Approach:

  1. Code: sanitize input string
  2. Code: regex to split the text on paragraph ends (sentence end + end of line)
  3. Code: stuff chunks of as many paragraphs as enters the chunk limit (2k tokens for better LLM focus)
  4. LLM: fix paragraphs and separate titles from paragraphs (to have double line return between items)
  5. LLM: parse formatted text to identify titles and paragraphs
  6. Code: build sections based on titles and paragraphs
  7. LLM: name items within each section for outlines
  8. LLM: identify hierarchy by names inside each section
  9. Code: build hierarchical JSON based on LLM’s hierarchy from the previous step
  10. LLM: build sections hierarchy using their names
  11. Code: build the final document hierarchy JSON

All LLMs are GPT3.5 fine tuned

1 Like

Oh how I’ve missed this thread.

Any reason you are not using json mode to get a JSON object with key value pairs returned here? Seems to make more sense to me as that would be easier to process downstream. On the other hand, if it works realiably it works reliably!

I am having some trouble with my current implementation that tries to do this. Mine works 90% of the time, but sometimes fails to put it back together in a valid JSON object. If you could provide a high level overview of your approach, that would be awesome.

Just an idea I had: You could do a cosine similarity search on chunks and experiment with a threshold cutoff for the level of details to include. E.g. you have a cosine similarity search and it returns [0,91], [0,87], [0,83] and [0,53], set the threshold to 0,8 and have the answer include all chunks with a similarity score of 0,8+. So sometimes it includes more or less chunks to provide different levels of detail. You could even go as far as set the threshhold to an absolute number, but to a % drop compared to the most relevant (or previous) chunk.

I think this isn’t necesarrily true. A Word file is essentially a zip file containing different markdown files. Not sure how this works for pdf, but I can see it being comparable. You probably have to go one layer deeper and locate and exclude the striketrough text programmatically before extracting the text (which removes all markdown).

My backend just sort of happened to be Node (I use @cyber2024/pdf-parse-fixed - npm for text extraction btw, best I’ve found) and can totally relate. A customer found a way to break my LLM limit with a resume with 35 jobs on it and I am on the verge of redoing the whole thing in Python. It seems to me like Python libaries and tooling for LLM’s are like 1-2 months ahead of whats available for Js.

I use weaviate that allows to handle the results clustering automatically, so all I have to do is to specify how many clusters of chunks I want. So far never had to grab more than 3 clusters in any of my apps (chunking algo helps a lot). So I just set that option by default and forget about it.

What I was referring to is a slightly different thing. Despite the close similarity between the query and found chunks, some of the chunks still do not contain or participate directly in the final answer. And I prefer filtering those out from my prompt to get better quality of the answer.

This combined approach helps me get both: wider range of found items and high quality of context while not touching the standard settings of the RAG engine itself.

Right. How do you currently apply the filtering? I assume you provide the returned chunks and query those with a LLM? Seems like prompt engineering is the way forward no?

There’s a reason why no library has been able to successful at implementing detection of strikethrough completely (to the best of my knowledge); which has to do with how the PDF language is implemented. That “one level deeper” just happens to be a little bit too deep. Always happy to be proven wrong.

I use Weaviate also, didn’t know about the clustering option – will need to look into that.

What I’ve been doing are two things:

  1. Small to Big Retrieval, where I programmatically retrieve x chunks before and after each chunk that is returned in the cosine similarity search: Advanced RAG 01: Small-to-Big Retrieval | by Sophia Yang, Ph.D. | Towards Data Science
  2. Chunk Retrieval Rating: I rate (0 - 10) each retrieved chunk as to it’s relationship to the query submitted. I remove those chunks with low ratings and only return to the model those which have the highest likelihood of responding to the query. This process is neither as time consuming nor expensive as I originally thought it would be.

These two methodologies, along with the Hierarchal/Semantic chunking process discussed here, and Weaviate using the OpenAI text-embedding-3-large embed model, are giving me the best responses I’ve ever received.

1 Like

I use line numbers in the original text document to always reliably retrieve all content for the final JSON output. Here is my high level diagram:

Here is more or less the way I approach the thing: How to confirm that you got the correct value from a text other than repeating the same prompt over and over - #4 by sergeliatko

1 Like

I was taking about the auto cut additional parameter: Additional operators | Weaviate - Vector Database

Have you compared the behavior of large vs small models? Personally I did not find a huge difference to justify the vector size increase (especially if using weaviate cloud services).

1 Like

I have not. I just went with large because I figured it would be the best. I was skeptical at first, but so far pleased with the results.

Thanks!

If there is one thing that is worth mentioning from what I’ve learned about AI - always run tests if a simpler approach/model would do the job with the same result.

Doesn’t take long to test but always pays back one way or another.

1 Like