I concur. However, I’ve been successful using Gemini Pro 1.5 for both tasks as described here: Using gpt-4 API to Semantically Chunk Documents - #166 by SomebodySysop.
Unfortunately, Gemini 1.5 Pro and Flash struggle mightily when it comes to strikethrough text:
This PDF: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_09.pdf
Extracts to:
https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_09_gemini_pro01.txt
Pretty darned good. However, when the strikethroughs are in the titles:
This PDF: https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_10.pdf
Extracts to:
https://s3.us-west-2.amazonaws.com/docs.scbbs.com/docs/test/2022_Local_161_MOA_10.txt
Gemini somehow can’t see that:
Should extract as: “ARTICLE 9. Sick Leave”.
gpt-4o mini and gpt4o do see it, albeit
a. I have to upload PDF pages as individual images.
b. The cost for processing the images, in my opinion, is excessive.
And, in my use case, efficiently removing strikethrough text is critical.
UPDATE: Got code working with Claude Sonnet 3.5 that eliminates all strikethrough text (so far) in images without issue.