What are the range of latency in RAG based applications?

I am using hybrid search and re-ranking for retrieval and gpt-4 for generation. My business is expecting to have response latency less that 2 secs. Is this even possible with RAG based applications. I wanted to know for the community what range of latencies are you seeing in the RAG based applications and any techniques that you follow to reduce latency.

This depends on a lot of factors including the level of nature of content you are looking to inject into your context. The retrieval part can often be executed extremely quickly - it’s the response generation time that really has the more material impact on the overall execution time.

Is 2 sec just referring to the threshold for the time until the first token when streaming the response?

yes, it is 2 sec from asking the question to time to first token