API Problem with file processing, file uploaded but not found in vector store

I upload a file, it uploads ok, and it shows in the dashboard

when I try to make a vector search, I receive the following:

ERROR dependencies Error processing file temp_files/ATACAMA_-_DIAGNOSTICO_CONECTIVIDAD_VIAL.pdf: Error code: 404 - {‘error’: {‘message’: “No file found with id ‘file-CJwRqNLWQyHyBRhEqcpd9M’ in vector store ‘vs_68ff71beecb48191b9897f30744c8a70’.”, ‘type’: ‘invalid_request_error’, ‘param’: None, ‘code’: None}}

and an additional Exception from my code:

httpx.HTTPStatusError: Client error ‘404 Not Found’ for url ‘``https://api.openai.com/v1/vector_stores/vs_68ff71beecb48191b9897f30744c8a70/files/file-CJwRqNLWQyHyBRhEqcpd9M’

The problem was detected today morning. When I try to check what is happening in the dashobard I see this:

I see the same error for all files, included those from past week, when the searches went OK and the API worked fine.

Any thoughts?

Best.

1 Like

Please document: are you using the OpenAI SDK’s upload-and-poll method?

I suggest instead writing your own status poller for retrieving vector store file by ID after an attachment, one that will sleep a few seconds and then tolerate that 404 being returned for several retries as a non-terminal status, while also waiting for the status key to change from “in_progress”.

Yes, I’m using the python library, vector_stores.files.create_and_poll, but as I’m seeing it, there are other posts with similar problems and in the api dashboard, that error posted below is showing too, It seems to me that the link between the vectors and the file is lost, so your suggestion maybe won’t apply. Now, the create_and_poll in python waits until the embedding process ends, anyway. My system is working like this since months ago without problem until today.

Maybe I will work on generating the embedding and search myself with FAISS or something like that… but what’s concerns me is that this problem is there at least for 12 hours and OpenAI doesn’t seem to be aware of it.

This is not a guess. This is a solution.

The behavior and performance of the vector store endpoint has changed, and the URL containing a file path is not immediately ready. Thus, The API SDK immediately polls and fails.

Mine won’t:

'''error-tolerant file uploader and vector store attachment'''
from openai import OpenAI; cl = OpenAI()
from time import sleep
from openai import NotFoundError

def upload_attach_poll(vs: str, file: str, suffix: str = "") -> str:
    # step 1: upload the file to OpenAI storage
    file_id = cl.files.create(
        file=open(file, "rb"),
        purpose="user_data",
        #expires_after={"anchor": "created_at", "seconds": 3600},
    ).id

    # step 2: attach the uploaded file to the vector store with attributes
    vs_file_id = cl.vector_stores.files.create(
        vector_store_id=vs,
        file_id=file_id,
        attributes={"name": file + suffix},
    ).id

    # step 3: poll the attachment status until it leaves 'in_progress'
    # - tolerate initial 404s before first observed status
    # - give it a minute - or be like openai and loop forever?
    max_retries = 30
    seen_status = False

    sleep(1)  # initial delay before the first retrieve

    for attempt in range(1, max_retries + 1):
        if attempt == max_retries:
            # On the final attempt, do not catch any error so the SDK error surfaces
            status = cl.vector_stores.files.retrieve(
                vector_store_id=vs,
                file_id=file_id,
            ).status
            print(f"Polling gave status {status}...")
            if status != "in_progress":
                return file_id
            break  # fall through to failure after max retries

        try:
            status = cl.vector_stores.files.retrieve(
                vector_store_id=vs,
                file_id=file_id,
            ).status
            seen_status = True
            if status != "in_progress":
                return file_id
        except NotFoundError:
            print(f"Polling gave NotFoundError - Retrying...")
            # Tolerate early 404s only until the first successful status retrieval
            if seen_status:
                # Do not tolerate a regression back to 404 after seeing a status
                raise

        sleep(2)  # fixed delay between retries

    # If we reach here, status remained 'in_progress' after max retries
    raise TimeoutError(f"file {file_id} remained 'in_progress' after {max_retries} retries")

# demo: make a vector store, use your local file, demo metadata, success uploading
filename = "vs_metadata_demo.py"
vs_id = cl.vector_stores.create(name="vstest").id
file1_id = upload_attach_poll(vs_id, file=filename, suffix="")
file2_id = upload_attach_poll(vs_id, file=filename, suffix="--second")
print(f"-- retrieving vector store file listing, looking for metadata")
sleep(2)  # this also is acting slow to create a complete listing
vs_files = cl.vector_stores.files.list(vector_store_id=vs_id)
for item in vs_files.data:
  print(item.model_dump())
  cl.files.delete(item.id)
cl.vector_stores.delete(vector_store_id=vs_id)

Function that takes your vector store ID, a local file, an optional suffix to add onto the end of the filename used as metadata – and then still gets it uploaded and attached.

Results in this version that prints:

Polling gave NotFoundError - Retrying...
Polling gave NotFoundError - Retrying...
-- retrieving vector store file listing, looking for metadata
{'id': 'file-QSfDDBSARoreLFEiiS1ZZP', 'created_at': 1761590207, 'last_error': None, 'object': 'vector_store.file', 'status': 'completed', 'usage_bytes': 1146, 'vector_store_id': 'vs_68ffbbb190fc8191a6f83381afa81f38', 'attributes': {'name': 'vs_metadata_demo.py--second'}, 'chunking_strategy': {'static': {'chunk_overlap_tokens': 400, 'max_chunk_size_tokens': 800}, 'type': 'static'}}
{'id': 'file-1P3Sg5pEzWYGXgiq2ZzqgG', 'created_at': 1761590196, 'last_error': None, 'object': 'vector_store.file', 'status': 'completed', 'usage_bytes': 1146, 'vector_store_id': 'vs_68ffbbb190fc8191a6f83381afa81f38', 'attributes': {'name': 'vs_metadata_demo.py'}, 'chunking_strategy': {'static': {'chunk_overlap_tokens': 400, 'max_chunk_size_tokens': 800}, 'type': 'static'}}

You can adapt if you just want to pass a file_id to such a function and not a file to upload then attach, or don’t want the usefulness of it making metadata with a file name. Crank up the poll time.

The vector store listing is also slow to update after a success status, so the demo of this function that uploads and shows different filenames as search query “attributes” also gets a sleep().

1 Like

Hi! We’re aware of this issue and looking into it. @iperich I checked the file and vector store in your description and if you try retrieving it now it should work – the file finished processing right after your initial retrieval request. We’re working on fixing both the playground bug and the delay in file status update that @_j points out. Thanks for your patience!

3 Likes

Interesting! I will try this. In my case, the create_and_poll took like 5-10 minutes to crash, that made me think that it was doing similar to what you did, that’s why I thought that the problem is not the time, but the loss of the link between file and vectors.

1 Like

I tried your approach and it worked!

Anyway, I’ts strange that create_and_poll seems it’s not doing the “polling” part from 2 o 3 days ago. If’ it was a change on some changelog, I wasn’t aware of it.

Thanks a lot!

1 Like