Hello everyone, I have uploaded a PDF file of approximately 500 pages and would like ChatGPT to help me with some analysis. After receiving multiple errors, GPT-4 provided me with an answer but indicated, “I’ve read through approximately 37% of the document. Due to the time constraints, I couldn’t complete the entire document.”
It seems that there are time limitations for each question processing. I would like to understand where the bottleneck occurs. For example, is it during PDF to text conversion, text reading, analysis, or generating the answer? Would it be helpful if I pre-converted the PDF into plain text?
It’s more probable that it doesn’t have a large enough context window to correctly manage the data, meaning it ran out of token context. Now, I have noticed that yes, plaintext is easier for it to manage, and would recommend .txt over .pdf files.
Size is most definitely your bottleneck regardless. Chunk the data into smaller segments to make it easier for the AI to work with. The smaller, the better.
I agree, converting large PDFs into plain text can significantly improve the models analysis productivity. A simplified format of .txt is in fact better and suits its current AI capabilities and volume.
A lot of techniques already exists to help.
Use PDF-to-Text Conversion Tools Optical Character Recognition (OCR) for Scanned PDFs Scripting and Automation Selective Text Extraction Compression and Cleanup Use Command Line Tools Consider Formatting
Now the effectiveness of these techniques can vary based on the original PDF’s quality and formatting. Complex PDF’s in the layouts and images will be restricted due to heavy formats and token/context window he would have allotted. So he has to make manual adjustments for the conversions to ensure the text files readability and accuracy. If he wants to also include the images in the PDF’s well that’s whole new monster.
Because of my use of it, I find It particularly does not do that. It doesn’t quote or cite any pieces longer than 3 or 5 words from my text - my texts are mostly from 100 hundred years ago novels, today I tested a Shakespeare drama, unsurprisingly it failed me.
I created a GPT model for use as a knowledge base, using source material from a 385-page technical manual from an ebook.
I uploaded knowledge files in various plain text file configurations: a single large text file, multiple text files, files with descriptive names, markdown, etc., but always ran into performance and/or accuracy issues.
In the end, I used a PDF export of the original ebook. It’s flawless… fast and accurate. I assume there is some kind of index/cache engine for PDF knowledge files.
Another thing to be aware of: ChatGPT will often use code interpreter to read files if this is turned on. The tool for knowledge retrieval search doesn’t give the AI any idea of what it will find by doing a search, or what the AI doesn’t know that it needs to obtain, and you cannot add your own description to this tool, so it often gets overlooked by the AI. An upload goes to the python sandbox also, which is an inviting thing to run when the AI gets a list of files in the mount point.