Responses API (with RAG) generation performance data

We use Responses API in combination with file-search (RAG) in a chat application that we provide for customers. It uses gpt-4o-mini as the model. I have seen (and shared) concerns over performance from Assistants and the Responses API. I thought some of you may welcome some specific data for comparing your own experience. Shown below are screenshots from our dashboard showing response times for generating responses. First one is from today. Second is from yesterday. Each vertical line represents one generate request. The height of the bar tells you how long it took to get the complete response. The dots show you on that line when the response started streaming. If you’re wondering about the different colors, that is because we are serving many different customers so different colors represent different customers.

Our experience is that some days are much better than others. Today has been reasonable. Yesterday was worse. Other days have been much worse.

Our goal with OpenAI has been to have streaming start within 7.5s and to complete before 15s (from start). And we want this to happen 95% of the time or more. As you can see, we are not achieving those goals. But the bigger problem is on those bad days when it is so “choppy”.

I should mention that we also generate responses for the same questions in parallel using Google. We don’t even bother to stream because they produce responses 90% of the time within 1s and 99% of the time within 3s.

You OpenAI folks should really focus on performance if you don’t want to be wiped out of the API market.

2 Likes

I think Sama recently tweeted they wanna talk to anyone with 100k+ GPU “packs”… so I’m sure it’s something they’re working on. Must be crazy to keep ChatGPT and everything else moving.

Good to see some hard numbers, though. Thanks for sharing!