PDFs will fail to upload to a vector store with no apparent reason

My PDFs are OCRd and don’t have any sort of password protection on them.

However, ~1/5 of them fail without reason when uploading to a vector store. I just tried to build a store with just one of problem documents, and the error message is simply ‘An Interval Server Error Occurred’ and the store just exists with the failed file inside.

I have absolutely no idea how to solve this or how to even begin to understand why they are failing. I’ve tried saving as a different format and saving back to a PDF, this didn’t work either.

The scarier thing is that some of these files actually used to work last week in other vector stores, and now they arent… this is very frustrating and is putting my project in critical condition.

Thanks in advance!

If the PDF has been converted to have inclusion of searchable text, you can extract with code and libraries yourself. This can be a programmatic task just like OpenAI does poorly, or can be plain text files where you optimize the text for AI understanding and manage how it might be chunked by the vector store (even placing small snippets of knowledge under the chunk size into individual files).

Uploading to code interpreter, retrieval, or now vector storage has always been a minefield of nonsense since this endpoint was added. “Structured data” refused if inspected text contents look like a CSV or JSON, JSON rejected unless converted to a non-validating form, PDFs plain ignored without warning that they are image-based, files refused for no apparent reason…

I actually solved it. I was using the highest chunk size (4096 and 2048) and it was causing ~20% of them to fail (no pattern to it that I could see).

I split the chunks in half, and now all 3000 files are working fine.

1 Like