Batch OCR and convert thousands of PDF files to text


I have a large number of scanned documents that are currently non-OCR PDF files. I need to extract the text from each and save it in some indexable format (txt, html, markdown, etc.). Ideally I should be able to save these as separate files within the same directory structure as the current files (either the exact same folders or another set of folders for the converted files).

The end goal is to be able to use this data in a private GPT to be able to search across all of them with AI assistance.

When I upload a single file from this batch to ChatGPT Pro and ask for a conversion to text, it does a very good job. So I think that basically what I need is a way to batch this process and save the results as separate files. Is there a way to do this?

Once that is done, my understanding is that custom GPTs are currently limited to only ten files, which would not work for this project (unless perhaps I could combine all of these text files, but that seems like a kludge). One option that I’ve seen is Private GPT, which seems to be something that I could connect to my ChatGPT pro account in order to extend its data and training to this content. Is that correct?

I’m sure that many, many others have dealt with similar circumstances, and I’m wondering what are the best methods that folks can recommend.

Thanks in advance!