Certain files result in FileParsingErrorCode.EMPTY_FILE when thread is run... causing the thread to be permanently broken

We have an application which allows users to upload files, including PDFs for incorporation into their chat. It is built using the Assistants API (v1), using the REST API. This setup works fine for most files.

The problem we have started seeing is that if the user uploads a file from which no text can be extracted (e.g. a PDF containing only images), then the file can be uploaded fine, and added to a message within a thread just fine… but when we then create a run for that thread, it fails in a very problematic way.
Basically, the run fails with the error buried as serialized json inside its error.message string (see details below)

…but more importantly, the thread is now unable to process any new messages… all subsequent messages just fail with the same error about the file. So, we can’t really know if the user’s file is going to violate this (or potentially other) requirements until we run the thread with it… but if it fails, that thread is now hosed forever (as far as I can tell).

Step 1. Upload the “empty” file

curl https://api.openai.com/v1/files \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F purpose="assistants" \
  -F file="@myfile.pdf"

Step 2. Create a new thread

curl https://api.openai.com/v1/threads \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "OpenAI-Beta: assistants=v1" \
  -d ''

Step 3. Create a message on that thread (with above file listed in its file_ids)

curl https://api.openai.com/v1/threads/thread_abc123/messages \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "OpenAI-Beta: assistants=v1" \
  -d '{
      "role": "user",
      "content": "Please summarize this",
      "file_ids":["file-abc123"]
    }'

Step 4. Create a Run

curl https://api.openai.com/v1/threads/thread_abc123/runs \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -H "OpenAI-Beta: assistants=v1" \
  -d '{
    "assistant_id": "asst_abc123"
  }'

Response:

{
  "error": {
    "message": "Failed to index file: Error extracting text from file 8c5bf288-3263-4a69-b825-a4a66ea777f8 detail_str=', detail: Extracted content contained only whitespace.' self.error_code=<FileParsingErrorCode.EMPTY_FILE: 'file_empty'>...",
    "type": "invalid_request_error",
    "param": null,
    "code": null
  }
}

(and as mentioned above, this error is returned for ALL future messages and runs on the thread)

1 Like

While it is a bug, you can prescreen the documents yourself for text.

I made up some rules and fed them to a bot.

Made two files, printing from a browser, and then printing a screenshot of the browser, to a PDF creator.

import PyPDF2
import re

def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text() or ""
    return text

def is_meaningful_text_extracted(text, min_word_count=20, min_period_count=2):
    words = text.split()
    period_count = text.count(".")
    return len(words) >= min_word_count and period_count >= min_period_count

def main():
    for doc in ["ex-w-text.pdf", "ex-no-text.pdf"]:
        pdf_path = doc
        text = extract_text_from_pdf(pdf_path)
        print(text[:80])
        if is_meaningful_text_extracted(text):
            print("Meaningful text successfully extracted from the document.")
        else:
            print("Unable to extract meaningful text from the document.")

if __name__ == "__main__":
    main()

Result of the two extractions:

Overview Documentation API reference
Log in
Sign up
Search K
Get started
Int
Meaningful text successfully extracted from the document.

Unable to extract meaningful text from the document.

1 Like

Thanks for the suggestion @_j That’s definitely a partial workaround to mitigate the problem a bit, but since we support other formats in addition to pdf (docx, txt, ppt, etc.) and those all exhibit the same problem, we’d have to implement that pre-screen step for each file format. Combined with the fact that it would be (approximately) duplicating the extraction task that is already being done by the Assistants API, but without assurance of exact-equivalence… we might still end up in a situation where their extraction yields no text even though ours did.

I’m fine with getting an error from the Assistants API for these problematic files, I guess what I’m hoping for is some way to just have it not permanently break the thread and be unrecoverable.

Messages can be deleted. The vector store can be cleaned. It would depend on how resilient any tool return within the thread that you can’t affect is at causing a continued error.