Vector search results are too slow

guptaa.pavan · December 19, 2025, 11:33am

I am using inbuilt open ai files and vector stores. My vector store size is around 500 kb. My search result is taking anywhere between 45 sec to 1 min . How can we improve search

import OpenAI from “openai”;

const openai = new OpenAI();

const response = await openai.responses.create({
model: “gpt-5-mini”,
tools: [{
type: “file_search”,
vector_store_ids: [“vs_1234567890”],
max_num_results: 20
}],
input: “What are the attributes of an ancient brown dragon?”,
});

console.log(response);

LarisaHaster · December 19, 2025, 3:00pm

A 45–60 second latency isn’t normal for a 500 KB vector store.

Typical causes include:

– file_search running inside the LLM call (performing vector lookup before generation)

– larger or slower embedding models

– high max_num_results causing reranking overhead

– model-specific tool routing latency

– network region delays

Try running the vector search before the LLM call, reducing result count, switching to text-embedding-3-small, or testing a different model like gpt-4o-mini.

With a 500 KB store, sub-second to a few seconds is typical, not a full minute.

guptaa.pavan · December 19, 2025, 3:33pm

I tried something like this too, still no luck takes same time.

 const requestParams: any = {
      model: 'gpt-5.2',
      instructions: getPrompt(),
      converstaion: conversationId,
      input: [
        {
          role: 'user',
          content:query
        },
      ],
      tools: [
        {
          type: 'file_search',
          vector_store_ids: [vectorStoreId],
        },
      ],
    };

Should we set max_num_results, ?
I am using a conversationId
Do you have an example ?

LarisaHaster · December 19, 2025, 3:47pm

You can set max_num_results, but that alone usually isn’t the true bottleneck.
If the latency is still 45–60 seconds even with a small vector store, then the slowdown is likely coming from:

Running the vector search inside the model call instead of before it
The embedding model you used to build the store
Tool-routing overhead from gpt-5.2

If you want, share just the vector store size + embedding model you used, that’s enough to narrow it down. No full code needed.

guptaa.pavan · December 19, 2025, 3:52pm

I am not using any model for vector store.

I first upload files via https://api.openai.com/v1/files
Next create a vectore store via https://platform.openai.com/docs/api-reference/vector-stores and associate the file ids that I created in in step 1.
Now I am directly use responses api file search with vector store id to search in documents

guptaa.pavan · December 19, 2025, 3:54pm

LarisaHaster · December 19, 2025, 4:18pm

You are using a model, vector stores are always built using an embedding model behind the scenes, even if you don’t manually specify one.

The next useful step is checking which embedding model was used automatically.

You can see it in the vector store details (under “model”).

Once we know that, it’s easier to understand why the latency is so high.

If you want, you can share just that single field, nothing else needed.

Here’s a reference, the vector store uses the same embedding models described here:

https://platform.openai.com/docs/guides/embeddings

You don’t need to change anything on that page, it’s just background info if you want to understand how vector search works internally.

_j · December 19, 2025, 4:24pm

If you want faster output from an AI use:

a non-reasoning AI
a reasoning AI with reasoning effort set lower
a mini model with faster token generation rate

If you want faster use of vector store:

make the vector store a permanent asset not being modified
upload and attach files dynamically in a UI, so they are ready before the user even types their message.

Recommendation currently: Use the API model gpt-4.1-mini for your question-answering when the answer is going to be in searchable documents.

guptaa.pavan · December 19, 2025, 4:33pm

I dont see an Option in OPEN AI platform to see the model used by vector store. Do you have an example API How to retrieve it

guptaa.pavan · December 19, 2025, 4:36pm

I didnt understand this statement “make the vector store a permanent asset not being modified”.

Currently workflow is this, User upload files, we attach them to Vector store thats created for him, if he adds more files, the Vector store will be updated with new file ids. Keeping vector store upto date with files. In gpt-4.1-mini we cant pass conversationId to keep the context right?

_j · December 19, 2025, 4:45pm

If it is on-demand user files (and not part of your application’s knowledge), then you’d do the second part of what I said: Give the user interface an “upload file” feature. As soon as that is used, upload the file to openai, and then attach to the session-based vector store ID that will be used in the conversation. Then you don’t have to wait for document extraction as you would if making all the requests at once only when “send” is pressed.

The conversation ID can be used with any model with the Responses API to maintain a server-side conversation state. It is an endpoint feature, not a model feature.

bobartig · December 21, 2025, 4:25am

If your vector store is 500kb, why not just run that locally?

rediatbrook · December 23, 2025, 3:27am

For a use case where search results are time sensitive, you probably want to implement a vector embedding based search system locally or wherever your server is. There are some decent libraries you can use like chroma and pgvector. You can bring it to under 1 second or maybe even a couple milliseconds with those approaches.

SBFSystems_Uganda · December 26, 2025, 6:00am

45 sec to 1 Min is the average speed I have encountered in all these “promised” RAG environments. To circumvent that, I had to quite using these vector stores where you have no control to my own where I have control and indexing.

Stephen_Braniff · December 26, 2025, 8:18pm

I have used weaviate on my laptop calling it via code (node js) and it’s instant re speed. Big fan of it. I’ve heard good things about milvus. Most of my clients are m365 for their nas so I use the azure index for a paas rag . Dunno if this helps but that’s what i know.

merefield · December 26, 2025, 11:11pm

I’m getting milliseconds response on “my own” cloud infra using pgvector and HNSW index but of course there are many variables.

Topic		Replies	Views
Fastest and most precise vector db and LLM Community gpt-4 , azure	4	2517	February 8, 2024
Response speed with semantic searching API	2	1457	December 29, 2023
Vector Storage RAG - AI Agent Not Retrieving Data Beyond Midpoint of Uploaded CSV API rag , vector-storage	7	142	April 12, 2026
Optimizing unstructured text data to be used with OpenAI Retrieval? API assistants-files	16	1342	May 15, 2024
Important delay issues with Assistants Using Retrieval Augmentation API assistants	7	1177	February 8, 2024

Vector search results are too slow

Related topics