Hi all,
I am creating a simple RAG based voicebot that is to be deployed on a car dealership. For this I am simply using the Azure AI search service as the vector index and GPT 4 turbo model as LLM.
The vector search is taking 2.5 to 3 seconds.
And the gpt 4 turbo response time is anywhere between 3 to 5 seconds.
I am thinking of switching the vector db, would using others like pinecone, weaviate etc improve the speed? If so which one would be best?
Also I am thinking of switching to llama 2 70b, what inference time can i expect from it?
My goal is to reduce the latency to 3 to 4 seconds.
That’s slow - how many records are you dealing with?
What is the database platform?
There are around 10K records.
I have used Azure AI search index to store them and am doing a hybrid (vector+ semantic reranking) to get the relevant records.
And yes, even i’m surprised at how slow this is working.
1 Like
Ah but you are doing a hybrid search, that may make it slower.
Using pgvector on PSQL with 150k records, I’m getting split-second ordered matches on a tiny VPS with only 4GB.
But that is using full on semantic search only. Perhaps you should consider it for performance reasons.
1 Like
Yes, having a split second response will be highly beneficial.
Thanks in advance
1 Like