My PDFs are OCRd and don’t have any sort of password protection on them.
However, ~1/5 of them fail without reason when uploading to a vector store. I just tried to build a store with just one of problem documents, and the error message is simply ‘An Interval Server Error Occurred’ and the store just exists with the failed file inside.
I have absolutely no idea how to solve this or how to even begin to understand why they are failing. I’ve tried saving as a different format and saving back to a PDF, this didn’t work either.
The scarier thing is that some of these files actually used to work last week in other vector stores, and now they arent… this is very frustrating and is putting my project in critical condition.
If the PDF has been converted to have inclusion of searchable text, you can extract with code and libraries yourself. This can be a programmatic task just like OpenAI does poorly, or can be plain text files where you optimize the text for AI understanding and manage how it might be chunked by the vector store (even placing small snippets of knowledge under the chunk size into individual files).
Uploading to code interpreter, retrieval, or now vector storage has always been a minefield of nonsense since this endpoint was added. “Structured data” refused if inspected text contents look like a CSV or JSON, JSON rejected unless converted to a non-validating form, PDFs plain ignored without warning that they are image-based, files refused for no apparent reason…