What are the range of latency in RAG based applications?

joyasree78 · April 9, 2024, 5:27am

I am using hybrid search and re-ranking for retrieval and gpt-4 for generation. My business is expecting to have response latency less that 2 secs. Is this even possible with RAG based applications. I wanted to know for the community what range of latencies are you seeing in the RAG based applications and any techniques that you follow to reduce latency.

jr.2509 · April 9, 2024, 5:39am

This depends on a lot of factors including the level of nature of content you are looking to inject into your context. The retrieval part can often be executed extremely quickly - it’s the response generation time that really has the more material impact on the overall execution time.

Is 2 sec just referring to the threshold for the time until the first token when streaming the response?

joyasree78 · April 9, 2024, 6:02am

yes, it is 2 sec from asking the question to time to first token

Topic		Replies	Views
What is considered as normal latency? API	3	2874	December 15, 2023
Does response/generation time of gpt 4 depends on size of input prompt? Community gpt-4	2	2657	May 30, 2023
Is there anyway to get the response time down to 2 seconds API api	4	959	April 27, 2024
RAG with Realtime API - samples / gudelines / best practices? API realtime	5	2457	November 5, 2024
API call latency poses an issue API api	0	455	April 15, 2024

What are the range of latency in RAG based applications?

Related topics