I am using inbuilt open ai files and vector stores. My vector store size is around 500 kb. My search result is taking anywhere between 45 sec to 1 min . How can we improve search
import OpenAI from “openai”;
const openai = new OpenAI();
const response = await openai.responses.create({
model: “gpt-5-mini”,
tools: [{
type: “file_search”,
vector_store_ids: [“vs_1234567890”],
max_num_results: 20
}],
input: “What are the attributes of an ancient brown dragon?”,
});
A 45–60 second latency isn’t normal for a 500 KB vector store.
Typical causes include:
– file_search running inside the LLM call (performing vector lookup before generation)
– larger or slower embedding models
– high max_num_results causing reranking overhead
– model-specific tool routing latency
– network region delays
Try running the vector search before the LLM call, reducing result count, switching to text-embedding-3-small, or testing a different model like gpt-4o-mini.
With a 500 KB store, sub-second to a few seconds is typical, not a full minute.
You can set max_num_results, but that alone usually isn’t the true bottleneck.
If the latency is still 45–60 seconds even with a small vector store, then the slowdown is likely coming from:
Running the vector search inside the model call instead of before it
The embedding model you used to build the store
Tool-routing overhead from gpt-5.2
If you want, share just the vector store size + embedding model you used, that’s enough to narrow it down. No full code needed.
I didnt understand this statement “make the vector store a permanent asset not being modified”.
Currently workflow is this, User upload files, we attach them to Vector store thats created for him, if he adds more files, the Vector store will be updated with new file ids. Keeping vector store upto date with files. In gpt-4.1-mini we cant pass conversationId to keep the context right?
If it is on-demand user files (and not part of your application’s knowledge), then you’d do the second part of what I said: Give the user interface an “upload file” feature. As soon as that is used, upload the file to openai, and then attach to the session-based vector store ID that will be used in the conversation. Then you don’t have to wait for document extraction as you would if making all the requests at once only when “send” is pressed.
The conversation ID can be used with any model with the Responses API to maintain a server-side conversation state. It is an endpoint feature, not a model feature.
For a use case where search results are time sensitive, you probably want to implement a vector embedding based search system locally or wherever your server is. There are some decent libraries you can use like chroma and pgvector. You can bring it to under 1 second or maybe even a couple milliseconds with those approaches.
45 sec to 1 Min is the average speed I have encountered in all these “promised” RAG environments. To circumvent that, I had to quite using these vector stores where you have no control to my own where I have control and indexing.
I have used weaviate on my laptop calling it via code (node js) and it’s instant re speed. Big fan of it. I’ve heard good things about milvus. Most of my clients are m365 for their nas so I use the azure index for a paas rag . Dunno if this helps but that’s what i know.