I have several different assistants with vector stores attached to each, I have a total of around 12000 files uploaded.
When I try to list all my files however, it seems to only list the most recently uploaded 10,000. I assumed at first that this was a pagination issue, and that it would be possible to list the next 10,000 files, but it doesn’t seem like the files.list() takes any args (such as before/after) that could be useful other than ‘purpose’? All my files have purpose=assistants, so how can I access file_ids of my older files to attach them to a new vector store?
I tried to use extra_query to pass in args that work at other endpoints but nothing I put into the extra_query seems to throw an error or change the file.list() response.
The limit on file list return you report is undocumented.
This seems like a case where you must (anyway) be keeping a local database of what you’ve uploaded and the purpose, as there is no metadata and no query you can perform besides sending a purpose. (Not that I didn’t communicate that need for metadata upon which a subsearch could be done to OpenAI months ago…)
Trying some random queries like assistants supports; like you say, no difference:
The strange thing is that the raw api response finishes with ‘has_more=True’ but using the pagination args that work with other endpoints do not effect the response.
Definitely possible to work around with separate db, and yes I have been keeping one for metadata, but it had been syncing with what was confirmed to be uploaded via using files.list().
I have experienced the same limitation - the 10,000 limit is documented for vector stores, but I could not find it officially documented for the files outside of the vector store. The workaround is to create a different project.
Re: Metadata - there seems to be no limit on the length of the file names and the API has no trouble accepting a json document as a file name. It’s terrible, but it… work. I actually use URI Encoded strings for mine and store things like a hash for quick comparison with the files I have locally when syncing up.
I think I once did hit a filename length limit, I don’t remember what the limit actually was though? But hashing json filenames is a good idea.
Also worth clarifying-you can definitely have more than 10k files in a single project, you just can’t list them all. I did realize though that you can list all file IDs associated with a vector store, and then look up the filenames via IDs one at a time.
I am interested how everyone is managing that many files?
I’ve recently been working on some architecture to wrap the API, and handing the assistant, vector store, vector store file and file objects that we have in production, and figuring out how to manage that against local file system and other sources. It’s been a challenge and am wondering if anyone has any suggestions?
I’ve started managing everything per vector store now, the file list for a vector store supports pagination so it seems like i can always see all file Ids through this method, as long as they are connected to a vector store.
I am also noticing that I must have > 10,000 files even though client.files.list() always returns 10,000 files.
The reason I dug into this issue today is that I just received this error when trying to create a vector store:
openai.BadRequestError: Error code: 400 - {'error': {'message': 'You have exceeded your file storage quota for assistants. Read more about increasing your quota: https://help.openai.com/en/articles/8550641-assistants-api-v2-faq', 'type': 'invalid_request_error', 'param': None, 'code': None}}
I asked OpenAI to increase my storage limit beyond the standard “file storage quota of 100 GB” as recommended here, and I’m waiting to hear back.
In the meantime, I started deleting some of my 2,000 vector stores and their associated vector store files and the files themselves by using client.beta.vector_stores.files.delete and client.files.delete on the appropriate ids. However, that was taking a long time, so I listed all the files and found that I had exactly 10,000 and I started just deleting files. After deleting dozens and receiving the proper confirmations such as FileDeleted(id='file-5NwFJYwI67H4XTiyZWarjII3', deleted=True, object='file'), I still have 10,000 files.
I can now create vector stores again, but I don’t know for how long because I can’t tell how many total files I have or my total storage amount. I just added up all the “bytes” data for the 10,000 files that I can list, and it’s about 40 GB but that is just a partial storage total. I don’t know if vector store file “usage_bytes” count towards the storage limit, but my total there is only about 3 GB.
If I was you I’d list all vector store ids, retrieve all files from each vector store, and create a local db of what is uploaded to openai. Maybe run that overnight since it will definitely take a long time.
Then you can sort through that and delete whatever you don’t need.
Also are you uploading pdfs directly? Just curious how you are getting such high storage, might be better to process into markdown files first.
Thanks, good idea. I saved local JSON files listing all my vector stores, all my vector store files (~11,500), and the 10,000 files I can list.
Since the vector store files are small, the files are large, and deleting files doesn’t seem to impact vector stores or assistants, I deleted many of my files today. After multiple file list and delete API calls, I am finally below 10,000 files and file listing now works as expected.
Yes, the majority of my files are PDFs. I considered converting them to text first, but I figured OpenAI might do a better job than me at parsing them. If I knew their exact methods, I’d be more willing to do it myself.
For me, file management is done through a git repo organized by vector store. When uploading to an assistant, I include the md5 of the file in the file name - so my process is to sync metadata from vector store, diff with local files, delete extra documents and upload new documents. This process can tied to a CICD process on a commit hook.
About using PDF, the best results I have had so far have been with markdown documents. PDF may contain tables or irregular flows, which makes them unpredictable. Obviously, this depends on the nature of your PDFs.
Also, keep in mind that the RAG process only sees small windows (800 token windows by default) of the document and quickly looses focus on what is around it. When converting to markdown, I repeat context info so that it gets captured within every 800 token windows (ex. chapter title, section name). This has dramatically improved the quality of my results.
I have been doing a ‘raptor’ ish thing with long docs where i include a ~800 token llm-generated summary of them at the start of the doc so that the broader concepts can get picked up as chunks, but throwing in at least the document title every 800 tokens is a great idea.
Regarding parsing pdfs into md, try llamaparse if you haven’t bo1. There are also some open source libraries around to do this. If you are hitting openai limits then this could save you 300$ a month in storage haha.
it’s a bit crazy imo that the api doesn’t enable a pagination approach to execution. at scale i’d imagine it becomes very difficult to ensure there aren’t unnecessary/unused files sitting in my openai storage. i am syncing meta data, but it’s not a full proof approach. i generally try to avoid having multiple sources of truth.
I have like 300,000+ files where I can’t figure out what is going on because there is no pagination. I am trying to avoid uploading duplicate content but alas… at least we can delete the entire container.