I’m currently extracting the text contents of PDF files and passing it to the API as plain text. I’m using gpt-3.5-turbo as I don’t have access to gpt4 yet. I’m also using the prompt chat system, to have the model read the PDF file and ask questions.
I’m kind of satisfied with how the model analyzes the PDF but there are times where it’s painfully inaccurate.
What can I do to train the model to accurately read the PDF file?
From my limited understanding, I could create embeddings from the PDF file, store them in a vector store database, then ask questions about the PDF file.
Need to parse your pdf into meaningful chunks (e.g. paragraphs), create embeddings of these chunks. Create an embedding of your prompt. Perform a semantic search and rank order the results (You should be keeping a track on what embedding corresponds to what paragraph on what page (useful for source checking)) and include the most similar paragraphs in the chat prompt until X tokens are reached. Voila!
Like @panayi067, I also am looking for a good PDF-to-text converter. I am following the tutorial mentioned by @SomebodySysop and applying it to research publications. It is not working well and I believe part of the issue is that footers and headers are read along with the text. Furthermore, references to images, datasets, and in-text citations also contribute to odd source choices by the vector search. I now am looking for a solid PDF-to-text converter that might at least recognize headers and footers and remove those (currently, I am testing pdf3f). Optionally, it would be amazing if it could also add metadata, for example checking if a text chunk is part of the abstract, methods section, bibliography, etc. But may be for these extra stuff PDF-to-markdown might be necessary.
I use ABBYY FineReader engine. Defnitely removes headers and footers. I’ve found their support to be excellent and they have an open ear to enhancements.
I do the PDF-to-markdown (adding metadata) myself using regex. But, I’m going to ask if they would be open to something like this: https://youtu.be/w_veb816Asg, which is similar to what you are asking.
I know the link notes HTML5 but in reality it is much more than that.
See my related StackOverflow answer for a different question that gives more details.
Years ago I started parsing PDF files and quickly learned that they are not data files but programs based on PostScript and that the PDF file was better understood as an archive file, think zip file, that contained a directory of PostScript, resources, etc. Anyway half way through the project I found that some open source software would at least correctly extract all of the data I needed but that it still had to undergo a transformation step that I had to create.
Over the years it seems a true test of any PDF to text is getting tables to convert correctly. Remember that many of tables are not text but some form of PostScript and it seems there are as many ways to make a table using PostScript as there are there are beverage options in a grocery store.
If you need a more robust solution, you can try Superinsight
You can upload multiple PDF files and also search across multiple PDF files as well.
You also have the option to use GPT3, GPT4 and other open source models as well.
Hi, in case you would still like another option/solution, you can try “File Convertor” on the Soffos.ai platform Playground: platform.soffos.ai/playground/file-converter
There you can either upload very large PDFs to convert to text dragging and dropping, or there is also an API that you can use to integrate into an app if you need that.
I’ve heard that ChatPDF released their API access, you might wanna take a look at that.
Otherwise, you may checkout PDF processors and pre-process your PDF file into plain-text before feeding into GPT API.