Stateful Responses API Much Slower Than Chat Completions

I posted this as indirectly related in another thread, but the issue seems specific enough to warrant its own thread.

On Chat Completions, GPT-5 (minimal reasoning) is averaging about 5-7 sec for me, even with 50+ message history and almost 100k tokens. On the Responses API with the same setup and number of messages/tokens, it is averaging 11 seconds and longer. I made sure it’s the same reasoning level, verbosity, etc as well.

It seems to be related to the use of previous_response_id. If I don’t use it, I can get response times back down to about 2-3 sec. However, even with just a few messages in the chain and 10k tokens use, it quickly increases the latency past 10 sec

I have the same issue. Chat completions is way faster and responses api is too slow even to handle few conversations. Its atleast twice as slow. I don’t even pass in previous_response_id. Latency is the same with or without conversation id

Wow that’s amazing. I’ve been using it pretty much exclusively on medium reasoning + high verbosity, with highly complex context windows requesting a wide variety of tool calling responses, often with about 100-150k tokens, and I wait anywhere from 2-10 minutes** for a response!!**

You can reduce the time to under a minute by setting to minimal reasoning, low verbosity. That is beside the point. The issue for me is with the same settings (minimal reasoning + low verbosity), the time on the Responses API is almost double what it is using Chat Completions API with previous_response_id set. Since the docs all say Responses should be what all new projects use, it’s concerning that it seems much less performant.

Hey all! Steve here from the OpenAI API Eng team. We hear you on the latency issues when using previous_response_id. We are working to optimize our database to make this as fast as possible. For the fastest possible latency, we’d recommend using store: false. In this mode, you’d roundtrip all items (like chat completions), and we skip the database so there is no latency hit looking up a previous response.

Thank you @stevecoffey . Latency exists even when not using previous_response_id but store is true. Is this a. known problem as well? Just want to understand whether the optimization you are working on is only for when previous_response_id is passed or even when it is not being passed.

Also, can you explain why we need to roundtrip all items - especially when passing in conversation id. I think the whole point of conversation API is to have state managed within openai servers right?

Responses API (AzureOpenAI) is significantly slower on average than the Chat Completions endpoint. Occasionally some Responses requests have extreme latency outliers (requests appear to get congested). Please investigate performance/regression of the Responses endpoint vs Chat Completions.

Statistical: Store = True

Responses: mean=4.268s median=2.349s min=1.421s max=21.711s stdev=4.903s
Chat : mean=1.354s median=1.298s min=0.902s max=2.385s stdev=0.330s

Statistical: Store = False

Responses: mean=2.901s median=2.264s min=1.476s max=6.520s stdev=1.530s
Chat : mean=1.257s median=1.203s min=0.891s max=1.813s stdev=0.286s

Code snippets

from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from openai import AzureOpenAI
import time, random, statistics
import matplotlib.pyplot as plt

tp = get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default")
client = AzureOpenAI(azure_endpoint="https://aoai-eastus2-0001.openai.azure.com/",
                     azure_ad_token_provider=tp, api_version="2025-04-01-preview")

N = 20
store = False # True
resp_times = []
chat_times = []

for i in range(N):
    n = random.randint(1, 1000000)
    # Change a little bit prompt to avoid hitting cache
    prompt_r = f"{n} Hello, one-sentence bedtime story about a unicorn."
    prompt_c = f"{n} Hiii, one-sentence bedtime story about a unicorn."

    t0 = time.perf_counter()
    r = client.responses.create(model="gpt-5", input=prompt_r, store=store,
                                reasoning={"effort":"minimal"}, text={"verbosity":"low"})
    resp_times.append(time.perf_counter() - t0)

    t0 = time.perf_counter()
    cc = client.chat.completions.create(model="gpt-5",
                messages=[{"role":"user","content":prompt_c}], store=store,
                reasoning_effort="minimal", verbosity="low")
    chat_times.append(time.perf_counter() - t0)

# --- statistics ---
def stats(a):
    return {
        "n": len(a),
        "mean": statistics.mean(a),
        "median": statistics.median(a),
        "stdev": statistics.pstdev(a) if len(a)>1 else 0.0,
        "min": min(a),
        "max": max(a),
    }

sr = stats(resp_times)
sc = stats(chat_times)

print("\nResponses:", f"mean={sr['mean']:.3f}s median={sr['median']:.3f}s min={sr['min']:.3f}s max={sr['max']:.3f}s stdev={sr['stdev']:.3f}s")
print("Chat      :", f"mean={sc['mean']:.3f}s median={sc['median']:.3f}s min={sc['min']:.3f}s max={sc['max']:.3f}s stdev={sc['stdev']:.3f}s")

# --- plot
plt.figure(figsize=(10,4))
plt.plot(range(1, N+1), resp_times, label="Responses")
plt.plot(range(1, N+1), chat_times, label="Chat Completions")
plt.xlabel("Run")
plt.ylabel("Time (s)")
plt.title(f"Responses mean {sr['mean']:.2f}s vs Chat mean {sc['mean']:.2f}s")
plt.legend()
plt.tight_layout()

thanks for this. @stevecoffey: Can you please look into this.