Was Choosing OpenAI’s Internal Tools a Mistake? Issues with Vector Store in Legal Use Case

Hachoum · September 12, 2025, 12:40pm

I’m facing a serious issue with OpenAI’s vector store.
I work with a Saudi law firm, and we’re trying to build RAG-like agents to assist with legal tasks.

The problem is when I upload .md files to the vector store:

It takes a very long time for the store to be ready to use.
Some files fail to upload, and their status shows as “failed”.

All the files are in the correct format (.md), each with a maximum of ~800 tokens, and we’re working with around 3,000 files. Despite that, the upload process is inconsistent and very difficult to manage.

On top of that, I find it frustrating that there’s no way to directly reach a real person in OpenAI’s help center to discuss these problems in depth.

Was my decision to rely on OpenAI’s internal tools the wrong choice?

EricGT · September 12, 2025, 2:51pm

Note: I am not an OpenAI employee and do not use OpenAI API.

If the code you are using is posted in a GitHub repository then try raising the issues there.

I note this because with CFG, we have been encouraged to use the GitHub repository.

_j · September 12, 2025, 5:26pm

Ingesting a corpus of 3000 files (note: where you also set the chunk size very high when attaching them to avoid splitting) means that 3000 files have to:

be uploaded to the files endpoint with the 3000 * 800 * 3-4 char per token: 10MB
Create a vector store ID with an API call
be “attached” with many API calls to a vector store ID container.
Internally, have document extraction run on each file, with the type detector.
Have chunking populate a database with file pieces (and real tokenization and not a size guess?)
have 3000 calls to an AI embeddings model run to obtain a tensor for each chunk.
put the strings and the vector results into a database.

Building your own vector database would still require the many embeddings API calls and the transfer of text in the calls. Then you can proactively blast those calls in parallel on-demand for reliability, but it will still take time.

Using OpenAI’s solution is not the best – just considering that in 30 days, there has been at least four or five cumulative days of outage in this service. Much better to pay up front for the embeddings – and then you permanently have the embeddings values in your own database if the same file chunk hash is encountered again.

(also note: if Arabic language is being encoded to tokens, the byte consumption per semantic meaning is much higher. Ensure you are using tiktoken to measure your assumption about size.)

Hachoum · September 12, 2025, 11:20pm

Thanks a lot for your helpful explanation it really clarified many things.

But I’m still facing an issue — my vector store has been stuck for about two days now with no progress:

Status: in_progress  
Created: 1757539623  
Total Files: 3953  
Completed: 2205  
Failed: 1227  
In Progress: 521  
Storage: 6,965,098 bytes

Do you think this means there’s something wrong with my data and I should just create a new vector store and start re-uploading everything? Or would that actually slow down both processes and maybe even risk violating the service terms or adding unexpected costs?

Thanks again for taking the time to share your insights, really appreciate it!

Hachoum · September 12, 2025, 11:24pm

Thanks for jumping in!
I actually used a very simple script, and the upload process itself worked fine — the real issue started after the upload.

Sometimes I even notice similar problems when uploading files manually through the OpenAI API platform, so it’s not just the script. But what’s strange now is that the process has been stuck for days, while before it usually only took minutes or at most a few hours.

_j · September 13, 2025, 1:00am

I think it might be better if you created the vector store, and then iterated through attaching each file to the vector store.

You can launch those in larger parallel groups of async API calls; you aren’t rate-limited but you would just keep the amount bounded concurrency in progress reasonable for the endpoint to not try to directly stimulate failures.

Then you’d have a processor that is resilient against individual failures, having a database of files that are being worked through, and then can try again a few times on one failure.

You’ll probably want to come up with a larger metadata format for processing, encompassing the file_id obtained by uploading, ensuring success there, then the metadata for files such as their chunking strategy, and the status of attachment and retries attempted, and then even bigger picture, by job, by customer. Certainly then a facility for long-term maintenance of removing and deleting files also. (not too far away then from your own vector store).

You can poll an individual file for its success status:

https://api.openai.com/v1/vector_stores/{vector_store_id}/files/{file_id}

{
  "id": "file-abc123",
  "object": "vector_store.file",
  "created_at": 1699061776,
  "usage_bytes": 1234,
  "vector_store_id": "vs_abcd",
  "status": "completed",
  "last_error": null
}

Just don’t try adding again if one is in progress too long, as you might ultimately have duplicates. Delete the vector file id that hangs instead of arriving at “status”: “completed”.

You should be able to get this done in far less than two days.

There’s a method to create mini-batches of files. Then you’d still have the problem of a list of “Failed” potentially taking a long time. Then again, you have to individually discover failed IDs out of that and resolve, from merely counts being returned.

POST https://api.openai.com/v1/vector_stores/{vector_store_id}/file_batches

Topic		Replies	Views
Vector indexing takes forever and vector_stores.files.list returns incorrect results API file-uploads , vector-store	3	128	November 6, 2025
Vector Store Indexing taking days Feedback assistants-api	22	1359	August 20, 2025
Creating vector stores via threads vs. fileBatches API gpt-4 , api , assistants-api , vector-store	0	257	November 5, 2024
Issues with OpenAI API – Vector Store Listing and File Upload Limit Bugs assistants-api	7	1309	March 24, 2025
openAI Retrieval Platform intermittent issue API	15	540	August 5, 2025

Was Choosing OpenAI’s Internal Tools a Mistake? Issues with Vector Store in Legal Use Case

Related topics