Is there any reason to keep a file around after it has been converted into a vector store? I ask because there is a 100GB limit on the number of files that an organization can upload but there is no stated limit that I could find on vector stores. So, in order to prevent exceeding the file limit, I would like to delete the files after creating vector stores from them. Is there any reason not to do this?
When deciding whether to keep files after converting them into a vector store, it depends on your specific use case and future needs. Here are the main points to consider:
- Re-indexing and Regeneration: If your vector store ever becomes corrupted, or if you need to change the embedding model, you would need the original files to recreate the vector store. Deleting them could mean losing this ability unless you back up the files elsewhere.
- Metadata and Context: Vector stores typically retain the vectorized representation of the data (e.g., numerical embeddings) but might not preserve full document metadata, structure, or non-text content. If that metadata or context is needed for future tasks (like formatting or specific structure), keeping the original files may be important.
- File Updates: If your files are subject to updates or revisions, the vector store won’t reflect these changes unless you reprocess the new versions. Keeping the originals around makes it easier to update your vector store as needed.
- Legal and Compliance Needs: Some industries or organizations require the retention of original documents for legal, regulatory, or auditing purposes. Deleting these files could cause issues if you later need to retrieve the original content for compliance checks.
If storage space is your primary concern and these risks don’t apply to your use case, you might be safe deleting the files once the vector store is created. However, maintaining a backup or alternative storage solution for the original files might be a good middle-ground approach to avoid potential issues later.
Thanks for your response. None of those situations really apply, however, after some testing I have found that when I delete a file that is attached to a vector store, the file_search tool stops working with respect to that vector store, so there is some kind of hidden dependency there. This is unfortunate because it prevents the creation of a multiuser app where many different users are potentially creating many assistants, and I think the 100GB limit could quickly be exceeded.
That’s why pasted bot answers are no better than the AI we can all ask that doesn’t know anyway.
- You can’t “change the embedding model”.
- There is no rich metadata, only extracted chunks
- So pointless, if there are new versions, the old files don’t matter.
- You want the AI to LOSE the files, not persist them, for the majority of compliance issues such as zero-retention.
You can’t get the files back out. So other blanks the AI tries to fill in are also pointless.
The storage space is almost never a concern as the bot concludes - you get 100GB of free storage for the uploaded files.
You get billed for vector store usage, though. Why would OpenAI cap that storage billing at $0.10 / GB vector-storage per day when it’s just sitting there with no AI expense? No limit is documented - and you put the same files in multiple vector stores for overlapping billing on the same source data.
There is a functional reason against deleting: The original file name is lost in the UI, you can see this when exploring vector stores after deleting the source file. The AI still gets the file name though, after deleting this PDF from storage:
You might consider that it is simply easier to leave the files there instead of re-uploading, if you were to want to assemble different combinations of files to another vector store, or if you want to try a new chunking strategy that needs the source file to be added again. Just don’t lose track - you can’t list over 10000 of them.
I thought the question is about deleting the original files and why or why not they should or should not? Am I wrong?
The understanding of the question is not wrong. The reasoning in your response for or against seems to have no basis on actual knowledge of the functionality of the Assistants API, though.
Hey _j, I see where you’re coming from. Let me explain more clearly:
-
Re-indexing: I was thinking that if the vector store got messed up or needed to be rebuilt, keeping the files might help. But since you can’t change the embedding model later, there’s really no point in keeping them unless you’re completely redoing the store.
-
Metadata: I talked about keeping files for metadata reasons, but you’re right—vector stores don’t keep that rich metadata. So unless someone really needs the original structure, there’s no need to hang on to the files.
-
File Updates: I brought up updates because if files are revised, you’d need to reprocess them. But if the data isn’t changing, you’re correct—keeping them doesn’t add value.
-
Compliance: You’re spot on with compliance. If you’ve got a zero-retention policy, deleting the files is the way to go. I was thinking of cases where legal rules require keeping them, but those are specific cases.
-
Storage: It’s true that storage space isn’t really the issue—the cost of vector store usage adds up over time. If you don’t need the files anymore, getting rid of them makes sense.
Thanks for the clarification…
Here are the links I used without the OpenAI references:
- LangChain Indexing API and Syncing Vector Stores (LangChain Blog
). - Zilliz Guide on Vectorizing and Querying Data (Zilliz Vector Hub
).
The only reason I care about deleting the files is because the 100GB limit will not work for my business model. I would rather leave them there and pay for what I’m using. The problem is I can’t possibly know beforehand how much file space I will need. If my product is successful, the 100GB limit will be exceeded and I’ll be in a bit of a pickle. So it’s kind of a deal breaker at the moment.