Vector search results are too slow

I am using inbuilt open ai files and vector stores. My vector store size is around 500 kb. My search result is taking anywhere between 45 sec to 1 min . How can we improve search

import OpenAI from “openai”;

const openai = new OpenAI();

const response = await openai.responses.create({
model: “gpt-5-mini”,
tools: [{
type: “file_search”,
vector_store_ids: [“vs_1234567890”],
max_num_results: 20
}],
input: “What are the attributes of an ancient brown dragon?”,
});

console.log(response);

1 Like

A 45–60 second latency isn’t normal for a 500 KB vector store.

Typical causes include:

– file_search running inside the LLM call (performing vector lookup before generation)

– larger or slower embedding models

– high max_num_results causing reranking overhead

– model-specific tool routing latency

– network region delays

Try running the vector search before the LLM call, reducing result count, switching to text-embedding-3-small, or testing a different model like gpt-4o-mini.

With a 500 KB store, sub-second to a few seconds is typical, not a full minute.

I tried something like this too, still no luck takes same time.

 const requestParams: any = {
      model: 'gpt-5.2',
      instructions: getPrompt(),
      converstaion: conversationId,
      input: [
        {
          role: 'user',
          content:query
        },
      ],
      tools: [
        {
          type: 'file_search',
          vector_store_ids: [vectorStoreId],
        },
      ],
    };
  1. Should we set max_num_results, ?
  2. I am using a conversationId
  3. Do you have an example ?

You can set max_num_results, but that alone usually isn’t the true bottleneck.
If the latency is still 45–60 seconds even with a small vector store, then the slowdown is likely coming from:

  1. Running the vector search inside the model call instead of before it

  2. The embedding model you used to build the store

  3. Tool-routing overhead from gpt-5.2

If you want, share just the vector store size + embedding model you used, that’s enough to narrow it down. No full code needed.

I am not using any model for vector store.

  1. I first upload files via https://api.openai.com/v1/files
  2. Next create a vectore store via https://platform.openai.com/docs/api-reference/vector-stores and associate the file ids that I created in in step 1.
  3. Now I am directly use responses api file search with vector store id to search in documents

You are using a model, vector stores are always built using an embedding model behind the scenes, even if you don’t manually specify one.

The next useful step is checking which embedding model was used automatically.

You can see it in the vector store details (under “model”).

Once we know that, it’s easier to understand why the latency is so high.

If you want, you can share just that single field, nothing else needed.

Here’s a reference, the vector store uses the same embedding models described here:

https://platform.openai.com/docs/guides/embeddings

You don’t need to change anything on that page, it’s just background info if you want to understand how vector search works internally.

1 Like

If you want faster output from an AI use:

  • a non-reasoning AI
  • a reasoning AI with reasoning effort set lower
  • a mini model with faster token generation rate

If you want faster use of vector store:

  • make the vector store a permanent asset not being modified
  • upload and attach files dynamically in a UI, so they are ready before the user even types their message.

Recommendation currently: Use the API model gpt-4.1-mini for your question-answering when the answer is going to be in searchable documents.

2 Likes

I dont see an Option in OPEN AI platform to see the model used by vector store. Do you have an example API How to retrieve it

I didnt understand this statement “make the vector store a permanent asset not being modified”.

Currently workflow is this, User upload files, we attach them to Vector store thats created for him, if he adds more files, the Vector store will be updated with new file ids. Keeping vector store upto date with files. In gpt-4.1-mini we cant pass conversationId to keep the context right?

If it is on-demand user files (and not part of your application’s knowledge), then you’d do the second part of what I said: Give the user interface an “upload file” feature. As soon as that is used, upload the file to openai, and then attach to the session-based vector store ID that will be used in the conversation. Then you don’t have to wait for document extraction as you would if making all the requests at once only when “send” is pressed.

The conversation ID can be used with any model with the Responses API to maintain a server-side conversation state. It is an endpoint feature, not a model feature.

1 Like

If your vector store is 500kb, why not just run that locally?

1 Like

For a use case where search results are time sensitive, you probably want to implement a vector embedding based search system locally or wherever your server is. There are some decent libraries you can use like chroma and pgvector. You can bring it to under 1 second or maybe even a couple milliseconds with those approaches.

2 Likes

45 sec to 1 Min is the average speed I have encountered in all these “promised” RAG environments. To circumvent that, I had to quite using these vector stores where you have no control to my own where I have control and indexing.

1 Like

I have used weaviate on my laptop calling it via code (node js) and it’s instant re speed. Big fan of it. I’ve heard good things about milvus. Most of my clients are m365 for their nas so I use the azure index for a paas rag . Dunno if this helps but that’s what i know.

1 Like

I’m getting milliseconds response on “my own” cloud infra using pgvector and HNSW index but of course there are many variables.

1 Like