We have hundreds of thousands of files uploaded for our RAG application. We discovered that many of these are duplicates and so we wrote some code to walk all of the files and remove duplicates. But we’ve run into a problem.
files.list({limit: 100, order: ‘desc’})
files.list({limit: 100, order: ‘asc’})
The response to the first one is a list with 100 files in it as you would expect. The response to the second one is an empty list, but with has_more=true!
You can see this by calling curl directly:
curl "https://api.openai.com/v1/files?order=asc&purpose=assistants&limit=10" -H "Authorization: Bearer sk-proj-xxxxxxxxxxxx"
{
"object": "list",
"data": [],
"has_more": true,
"first_id": null,
"last_id": null
}
We have already deleted thousands of duplicates by walking (in ascending order) these lists. I suspect that OpenAI may have a bug where files.list is walking these already-deleted files and not including them – but still setting has_more=true. I’m wondering if I wait a while whether those will actually get deleted and it will start working again. But I’ve now been waiting 8 hours and it is still not returning any items.
Anyone aware of this issue? I realize I can walk the list in the opposite order, but I fear that I’ll run into the same issue if I call it more than once when a complete page of the list is filled with deleted items.