"unsupported_file" error when creating vector store with certain plain-text/markdown files

I have found and tried to diagnose an error which prevented around 50% of my documents to be added to a vector store. Specifically, certain plaintext and markdown files encounter the following problem when being added to a vector store: “The file type is not supported”.

However, simple (often one-character) changes can often circumvent this error. Below is a working example for reproducing the error, followed by an example that differs by just one character where the error does not occur.

MWE1 (produces error)

from openai import OpenAI

client = OpenAI(...)

content = """Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin ac porttitor eros. Etiam quis neque nisi. Proin in turpis augue. Vivamus ullamcorper lobortis enim, a bibendum urna mollis feugiat. Phasellus tortor justo, laoreet non elementum et, blandit faucibus lacus. Etiam luctus convallis massa vel facilisis. Sed congue in nibh bibendum sodales. Cras in diam vel ligula molestie imperdiet gravida eleifend urna. Proin eu lectus et erat lacinia mollis in at turpis. Nam sodales orci neque, vitae feugiat risus rutrum quis. Donec porttitor egestas eros, sed pulvinar eros. Aenean lacinia orci lorem. Curabitur sem augue, interdum a cursus ut, blandit nec felis. Ut bibendum eros tempus tellus imperdiet interdum. Nunc commodo sodales mattis.
Here is some data:
17
0.32
0.26"""

# Upload file
file_id = client.files.create(
    file=("mymarkdown.md", content.encode("utf-8"), "text/markdown"),
    purpose="assistants",  # purpose="user_data"
).id

# Create vector store
vector_store = client.vector_stores.create(file_ids=[file_id])

# Wait for file to be processed
sleep(1)

# List files in vector store
files = client.vector_stores.files.list(vector_store_id=vector_store.id)

if files.data[0].last_error:
    print("Error:", files.data[0].last_error)
else:
    print("Success")

Output:

Error: LastError(code='unsupported_file', message='The file type is not supported.')

MWE2 (does not produce error)

content = content.replace("0.32", "0.3")
... # remaining code same as above

Output:

Success

Simply removing this one character (the 2 in 0.32) circumvents the error.

Thanks in advance for looking into this issue.

3 Likes

We are facing the exact same issue:
There are two text files (.txt), one copy of the other one with last character (a number) deleted. The first one returns the unsupported_file error while the other one works.

This is making the API very unreliable, as there are no clear reasons why the plain text file is not working and the error can happen to any files at any time.

We have tried the methods below (however, according to the example above the issue is most likely not related to encoding or special characters):

  • Decoding and encoding the file again (as utf-8)
  • Uploading file and waiting for a few seconds before adding to vector store
  • Unicode normalizing
  • Regex clearing for any special characters
  • Even checked the hex content of the file to make sure no special characters is deleted with the mentioned number