We have an application which allows users to upload files, including PDFs for incorporation into their chat. It is built using the Assistants API (v1), using the REST API. This setup works fine for most files.
The problem we have started seeing is that if the user uploads a file from which no text can be extracted (e.g. a PDF containing only images), then the file can be uploaded fine, and added to a message within a thread just fine… but when we then create a run for that thread, it fails in a very problematic way.
Basically, the run fails with the error buried as serialized json inside its error.message string (see details below)
…but more importantly, the thread is now unable to process any new messages… all subsequent messages just fail with the same error about the file. So, we can’t really know if the user’s file is going to violate this (or potentially other) requirements until we run the thread with it… but if it fails, that thread is now hosed forever (as far as I can tell).
Step 1. Upload the “empty” file
curl https://api.openai.com/v1/files \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F purpose="assistants" \
-F file="@myfile.pdf"
Step 2. Create a new thread
curl https://api.openai.com/v1/threads \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "OpenAI-Beta: assistants=v1" \
-d ''
Step 3. Create a message on that thread (with above file listed in its file_ids)
curl https://api.openai.com/v1/threads/thread_abc123/messages \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "OpenAI-Beta: assistants=v1" \
-d '{
"role": "user",
"content": "Please summarize this",
"file_ids":["file-abc123"]
}'
Step 4. Create a Run
curl https://api.openai.com/v1/threads/thread_abc123/runs \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-H "OpenAI-Beta: assistants=v1" \
-d '{
"assistant_id": "asst_abc123"
}'
Response:
{
"error": {
"message": "Failed to index file: Error extracting text from file 8c5bf288-3263-4a69-b825-a4a66ea777f8 detail_str=', detail: Extracted content contained only whitespace.' self.error_code=<FileParsingErrorCode.EMPTY_FILE: 'file_empty'>...",
"type": "invalid_request_error",
"param": null,
"code": null
}
}
(and as mentioned above, this error is returned for ALL future messages and runs on the thread)