Some days ago i tried to use LLamaIndex (vector database with semantic search) + gpt3.5 API for question answering about pdf textual data.
The problem was latency: more than 30s to wait until you get an answer.
Since i need to put this app in production environment, this latencies are unacceptable for the customer.
I tried to use SentenceTransformer for embedding + openai API and was able to cut latency to 20s, but that’s still too much for production.
Of this 20s, 15 are due to openai API.
Is there a way to reduce this time? I would like to get to 3-7 s latency…