It took me almost two weeks to finally get something working. Apparently, you can NOT upload PDF files to Gemini through the Google AI Studio API. Only through the Vertex AI API.
So, I created a prompt, and modified it a gazillion times to try and get Gemini 1.5 Flash to consistently extract text EXCLUDING strikeout text, and it just wouldn’t do it. Gemini 1.5 Pro will recognize the strikeout text and and follow the prompt commands consistently.
Here is the PDF source: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_09.pdf
And this is the output from Gemini 1.5 pro: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_09_gemini_pro01.txt
Now, I’m fairly certain that GPT-4o will also do it consistently, but here’s the rub:
Prompt Token Count: 1163
Candidates Token Count: 1380
Total Token Count: 2543
Gemini Pro Pricing
$3.50 / 1 million tokens (for prompts up to 128K tokens)
$10.50 / 1 million tokens (for prompts up to 128K tokens)
OpenAI GPT-4o Pricing
$5.00 / 1M input tokens
$15.00 / 1M output tokens
I was going to look at GPT-3.5-turbo, but there still hasn’t been a response to this: Can you upload PDF files directly to OpenAI's GPT-3.5 model?
And then there is the 16K total token context and 4K output token limits (Gemini’s output limit is 8K).
So, my PDF to text extraction pipeline options are now:
- AWS Textract
- PdfToText
- Solr (tika)
- PyMuPdf (markdown)
- Marker (markdown)
and soon to be added:
- LLM (Gemini|GPT-4o)
Pretty impressive, if I must say so myself.
And, speaking of impressive, I did find an API that uses LLMs to extract text from PDFs: LlamaParse: Convert PDF (with tables) to Markdown (youtube.com)
I tried it, it works – but I could not get it to exclude strikethrough text, which is why I ended up going with Gemini. I’m sure there is (or will be soon) a way to do it, but I couldn’t figure it out.
Once I get this new extractor added to the pipeline, I think that’s going to be it. I will have my Hierarchal|Semantic Chunking pipeline, as discussed in this long thread, completed. Will post here once it’s done.
p.s. Unfortunately, in order to get Textract and Vertex AI (and PyMuPdf and Marker) working, I had to go all in with Python. The good news is that everything is installed in a Docker container, so I’ve built a template that will go in and execute the tools I need as necessary. Still wish I could have done it all in PHP, but it’s not too bad of a setup.