Using gpt-4 API to Semantically Chunk Documents

SomebodySysop · July 1, 2024, 8:11am

It took me almost two weeks to finally get something working. Apparently, you can NOT upload PDF files to Gemini through the Google AI Studio API. Only through the Vertex AI API.

So, I created a prompt, and modified it a gazillion times to try and get Gemini 1.5 Flash to consistently extract text EXCLUDING strikeout text, and it just wouldn’t do it. Gemini 1.5 Pro will recognize the strikeout text and and follow the prompt commands consistently.

Here is the PDF source: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_09.pdf

And this is the output from Gemini 1.5 pro: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_09_gemini_pro01.txt

Now, I’m fairly certain that GPT-4o will also do it consistently, but here’s the rub:

Prompt Token Count: 1163
Candidates Token Count: 1380
Total Token Count: 2543

Gemini Pro Pricing

$3.50 / 1 million tokens (for prompts up to 128K tokens)
$10.50 / 1 million tokens (for prompts up to 128K tokens)

OpenAI GPT-4o Pricing

$5.00 / 1M input tokens
$15.00 / 1M output tokens

I was going to look at GPT-3.5-turbo, but there still hasn’t been a response to this: Can you upload PDF files directly to OpenAI's GPT-3.5 model?

And then there is the 16K total token context and 4K output token limits (Gemini’s output limit is 8K).

So, my PDF to text extraction pipeline options are now:

AWS Textract
PdfToText
Solr (tika)
PyMuPdf (markdown)
Marker (markdown)

and soon to be added:

LLM (Gemini|GPT-4o)

Pretty impressive, if I must say so myself.

And, speaking of impressive, I did find an API that uses LLMs to extract text from PDFs: LlamaParse: Convert PDF (with tables) to Markdown (youtube.com)

I tried it, it works – but I could not get it to exclude strikethrough text, which is why I ended up going with Gemini. I’m sure there is (or will be soon) a way to do it, but I couldn’t figure it out.

Once I get this new extractor added to the pipeline, I think that’s going to be it. I will have my Hierarchal|Semantic Chunking pipeline, as discussed in this long thread, completed. Will post here once it’s done.

p.s. Unfortunately, in order to get Textract and Vertex AI (and PyMuPdf and Marker) working, I had to go all in with Python. The good news is that everything is installed in a Docker container, so I’ve built a template that will go in and execute the tools I need as necessary. Still wish I could have done it all in PHP, but it’s not too bad of a setup.

Topic		Replies	Views
Document Sections: Better rendering of chunks for long documents Prompting vector-db , semantic-search	66	31905	April 1, 2025
The length of the embedding contents API	48	34359	December 13, 2023
New 4-turbo model has a unique limit? Or is this a bizarre hallucation? API	18	4492	January 26, 2024
⬛ Splitting / Chunking Large input text for Summarisation (greater than 4096 tokens....) API	24	45328	December 12, 2023
Poor quality response on trained LLM with pdf files Community gpt-4	29	6308	May 1, 2024

Using gpt-4 API to Semantically Chunk Documents

Gemini Pro Pricing

OpenAI GPT-4o Pricing

Related topics