Was Choosing OpenAI’s Internal Tools a Mistake? Issues with Vector Store in Legal Use Case

I’m facing a serious issue with OpenAI’s vector store.
I work with a Saudi law firm, and we’re trying to build RAG-like agents to assist with legal tasks.

The problem is when I upload .md files to the vector store:

  • It takes a very long time for the store to be ready to use.

  • Some files fail to upload, and their status shows as “failed”.

All the files are in the correct format (.md), each with a maximum of ~800 tokens, and we’re working with around 3,000 files. Despite that, the upload process is inconsistent and very difficult to manage.

On top of that, I find it frustrating that there’s no way to directly reach a real person in OpenAI’s help center to discuss these problems in depth.

Was my decision to rely on OpenAI’s internal tools the wrong choice?

Note: I am not an OpenAI employee and do not use OpenAI API.

If the code you are using is posted in a GitHub repository then try raising the issues there.

I note this because with CFG, we have been encouraged to use the GitHub repository.

1 Like

Ingesting a corpus of 3000 files (note: where you also set the chunk size very high when attaching them to avoid splitting) means that 3000 files have to:

  • be uploaded to the files endpoint with the 3000 * 800 * 3-4 char per token: 10MB
  • Create a vector store ID with an API call
  • be “attached” with many API calls to a vector store ID container.
  • Internally, have document extraction run on each file, with the type detector.
  • Have chunking populate a database with file pieces (and real tokenization and not a size guess?)
  • have 3000 calls to an AI embeddings model run to obtain a tensor for each chunk.
  • put the strings and the vector results into a database.

Building your own vector database would still require the many embeddings API calls and the transfer of text in the calls. Then you can proactively blast those calls in parallel on-demand for reliability, but it will still take time.

Using OpenAI’s solution is not the best – just considering that in 30 days, there has been at least four or five cumulative days of outage in this service. Much better to pay up front for the embeddings – and then you permanently have the embeddings values in your own database if the same file chunk hash is encountered again.

(also note: if Arabic language is being encoded to tokens, the byte consumption per semantic meaning is much higher. Ensure you are using tiktoken to measure your assumption about size.)

2 Likes

Thanks a lot for your helpful explanation :folded_hands: it really clarified many things.

But I’m still facing an issue — my vector store has been stuck for about two days now with no progress:

Status: in_progress  
Created: 1757539623  
Total Files: 3953  
Completed: 2205  
Failed: 1227  
In Progress: 521  
Storage: 6,965,098 bytes

Do you think this means there’s something wrong with my data and I should just create a new vector store and start re-uploading everything? Or would that actually slow down both processes and maybe even risk violating the service terms or adding unexpected costs?

Thanks again for taking the time to share your insights, really appreciate it! :raising_hands:

Thanks for jumping in! :folded_hands:
I actually used a very simple script, and the upload process itself worked fine — the real issue started after the upload.

Sometimes I even notice similar problems when uploading files manually through the OpenAI API platform, so it’s not just the script. But what’s strange now is that the process has been stuck for days, while before it usually only took minutes or at most a few hours.

I think it might be better if you created the vector store, and then iterated through attaching each file to the vector store.

You can launch those in larger parallel groups of async API calls; you aren’t rate-limited but you would just keep the amount bounded concurrency in progress reasonable for the endpoint to not try to directly stimulate failures.

Then you’d have a processor that is resilient against individual failures, having a database of files that are being worked through, and then can try again a few times on one failure.

You’ll probably want to come up with a larger metadata format for processing, encompassing the file_id obtained by uploading, ensuring success there, then the metadata for files such as their chunking strategy, and the status of attachment and retries attempted, and then even bigger picture, by job, by customer. Certainly then a facility for long-term maintenance of removing and deleting files also. (not too far away then from your own vector store).

You can poll an individual file for its success status:

https://api.openai.com/v1/vector_stores/{vector_store_id}/files/{file_id}

{
  "id": "file-abc123",
  "object": "vector_store.file",
  "created_at": 1699061776,
  "usage_bytes": 1234,
  "vector_store_id": "vs_abcd",
  "status": "completed",
  "last_error": null
}

Just don’t try adding again if one is in progress too long, as you might ultimately have duplicates. Delete the vector file id that hangs instead of arriving at “status”: “completed”.

You should be able to get this done in far less than two days.


There’s a method to create mini-batches of files. Then you’d still have the problem of a list of “Failed” potentially taking a long time. Then again, you have to individually discover failed IDs out of that and resolve, from merely counts being returned.

POST https://api.openai.com/v1/vector_stores/{vector_store_id}/file_batches

1 Like