GPT 5.2 Extended thinking webchat has unworkably slow token (4 tps) generation

Here was my prompt:

I have run into a problem, recently the token generation is unworkably slow in chat for GPT 5.2 Extended thinking. I don’t want to use other thinking levels. Is there a reason for this?

The thinking process didn’t seem inappropriately slow, just the final output generation which took a very long time. This is unworkable with larger code outputs. Eg:

This took 60 seconds to generate. It’s 219 tokens, so about 4 tps:

Yes. What you’re describing is a common combination of (a) how Extended thinking is designed to behave and (b) a few factors that can make it feel dramatically slower than it used to.

Why GPT-5.2 “Extended thinking” can get very slow

  • Extended thinking deliberately spends more compute before and during the answer. In the ChatGPT UI, Extended (and Heavy for some plans) is explicitly the “more time for deeper reasoning” setting, so slower replies are an expected tradeoff.

  • Long chats slow down every new turn. GPT-5.2 Thinking supports a much larger context window (up to 196K tokens for paid tiers). That’s great for continuity, but it also means the model may be rereading and reasoning over a lot more history each time, which can slow both the start of the response and the streaming speed.

  • Peak load can reduce throughput even without a declared outage. OpenAI’s own troubleshooting notes that some users see slowness during peak hours.

  • Browser/UI issues can look like “slow token generation.” If the page is bogged down

I don’t do anything sketchy, I don’t do any browser automation, or particularly heavy usage (though I do use extended thinking a lot, as it is quite superior in its results, nowhere near the 3000 per week limit however).

I’ve been a subscriber since pretty much day one of GPT3.5 release (I even contributed PRs to the github repo to get beta access! Those were the days).

This is the first time I’ve ever encountered any serious issues with the service. It has been like this for several days now. I asked on the r/OpenAI reddit, but there wasn’t much help. So I am hoping I might get some more responses here.

2 Likes

I also posted on twitter, and got a few responses. https://x.com/QRDL/status/2017696108169081072

That you mention “extended” confirms this is about ChatGPT, the web chatbot.

There, you don’t get any metric of how many internal tokens the AI model generated thinking about the task, or performing actual work such as using Python or its shell to verify answers programmatically.

What you receive can be just a fraction of the actual tokens produced by the model, in multiple iterative turns, calling tools, exploring the web and sites, etc. Thus you cannot inspect the actual token production rate like you can on the API where you have a report of the tokens you were billed.

Yes, it’s dang slow to finish. “Test time compute” is what that’s called. A model that figures things out by deliberations in its internal writings. The subscription product is what they decide to deliver you and what you decide you can tolerate for performance; you are the one asking for extended thinking time yourself.

And just maybe it took some work.

Better than what I just got after the thing stewed for 20+ minutes.

Yes, I mentioned chat in the title. I’ll adjust it to make it more clear.

As I mentioned in my post, I’m not talking about thinking. That’s fine. It’s the output generation, in particular problematic for code generation.

This is unworkably slow and new behavior. I’ve been using ChatGPT since it first released.

To add some terminology .. time to first token (eg, the ‘thinking’, which seems normal) is not the problem here. It’s the streaming speeds after that which is around 4tps or lower.

Followup to “stopped thinking” (which I’ve gotten many times trying ChatGPT; the AI dead-ends on thinking its deliverable is delivered or something).

I’ve gotten 60k+ tokens of reasoning output on API before, with much less delivered as output. Figure gpt-5 internally goes 25-50TPM (50 is the Enterprise service level guarantee with “priority”. Do some math, and that explains the time taken to see your product.

Dangit: I should have copied code while it was seen - now the chat output is gone, and a refresh won’t bring the output back. Silly time waste of an hour+.

Thank you for the reply, but this is not related to the API, but rather the token stream time after the first token starts (after thinking). 4 tps is quite slow.

Some other people have been posting about this - Reddit - The heart of the internet

Is the response generation also very slow for you? ChatGPT

Question

Feeling 1/3 of the usual pace with default reasoning effort.

No problems reported, but the thing certainly seems problematic. The slow throughput means you can barely even classify whether there is really an issue, since it’s almost built to slow down any conversations on you. No need for a daily limit when the rate you can chat about hard work is two per hour.

The normal thinking “gimme the code” doesn’t even understand that I provided the wheel and the mount point data file and the installation instructions.

# Seed offline tiktoken cache if user followed the described setup.
_OFFLINE_CACHE_DIR = "/mnt/data"
_OFFLINE_CACHE_FILE = "/mnt/data/fb374d419588a4632f3f557e76b4b70aebbca790"

@_j Can you do a test? Try this prompt in a new chat:

”I have run into a problem, recently the token generation is unworkably slow in chat for GPT 5.2 Extended thinking. I don’t want to use other thinking levels. Is there a reason for this?”

I’d be curious after the first token starts, how long it takes to generate the next 200 tokens. Ie, what your token per second rate is for the response generation.

Note that this is GPT 5.2 Extended Thinking. GPT 5.2 Auto and Instant are very fast as expected. (10s of tps). GPT 5.2 Thinking has the same problem.

I can do you better and empower you.

Receiving the output took about 28 second more; 1562 tokens of code.

How close can you get to hitting 5 seconds?

import time
import tkinter as tk
from tkinter import ttk
from tkinter import font as tkfont


def format_hhmmss_tenths(seconds: float) -> str:
    if seconds < 0:
        seconds = 0.0
    tenths_total = int(seconds * 10 + 1e-9)
    d = tenths_total % 10
    total_seconds = tenths_total // 10
    s = total_seconds % 60
    m = (total_seconds // 60) % 60
    h = total_seconds // 3600
    return f"{h:02d}:{m:02d}:{s:02d}.{d}"


class StopwatchApp(tk.Tk):
    def __init__(self) -> None:
        super().__init__()

        self.title("Task Segment Stopwatch")
        self.minsize(720, 520)

        self._running = True
        self._start = time.monotonic()
        self._last_mark = self._start
        self._lap = 0
        self._refresh_ms = 25

        self._elapsed_var = tk.StringVar(value="00:00:00.0")
        self._segment_var = tk.StringVar(value="00:00:00.0")

        self._build_ui()
        self.protocol("WM_DELETE_WINDOW", self._on_close)

        self.after(0, self._tick)

    def _build_ui(self) -> None:
        style = ttk.Style(self)
        try:
            style.theme_use("clam")
        except tk.TclError:
            pass

        base = ttk.Frame(self, padding=14)
        base.grid(row=0, column=0, sticky="nsew")
        self.grid_rowconfigure(0, weight=1)
        self.grid_columnconfigure(0, weight=1)

        base.grid_columnconfigure(0, weight=1)
        base.grid_rowconfigure(2, weight=1)

        fixed = tkfont.nametofont("TkFixedFont")
        title_font = fixed.copy()
        title_font.configure(size=max(11, fixed.cget("size") + 1), weight="bold")
        elapsed_font = fixed.copy()
        elapsed_font.configure(size=max(18, fixed.cget("size") + 10), weight="bold")
        segment_font = fixed.copy()
        segment_font.configure(size=max(34, fixed.cget("size") + 24), weight="bold")

        header = ttk.Frame(base)
        header.grid(row=0, column=0, sticky="ew")
        header.grid_columnconfigure(0, weight=1)

        ttk.Label(header, text="Elapsed (since start)", font=title_font).grid(
            row=0, column=0, sticky="w"
        )
        self._elapsed_label = ttk.Label(
            header, textvariable=self._elapsed_var, font=elapsed_font
        )
        self._elapsed_label.grid(row=1, column=0, sticky="w", pady=(2, 0))

        ttk.Separator(base).grid(row=1, column=0, sticky="ew", pady=12)

        mid = ttk.Frame(base)
        mid.grid(row=2, column=0, sticky="nsew")
        mid.grid_columnconfigure(0, weight=1)
        mid.grid_rowconfigure(1, weight=1)

        segment_row = ttk.Frame(mid)
        segment_row.grid(row=0, column=0, sticky="ew")
        segment_row.grid_columnconfigure(0, weight=1)

        ttk.Label(segment_row, text="Current segment (since last press)", font=title_font).grid(
            row=0, column=0, sticky="w"
        )
        self._segment_label = ttk.Label(
            segment_row, textvariable=self._segment_var, font=segment_font
        )
        self._segment_label.grid(row=1, column=0, sticky="w", pady=(2, 8))

        button_row = ttk.Frame(mid)
        button_row.grid(row=0, column=1, sticky="ne", padx=(12, 0))
        self._mark_btn = ttk.Button(button_row, text="Mark / Next Segment", command=self._mark)
        self._mark_btn.grid(row=0, column=0, sticky="e")
        self.bind("<space>", lambda _e: self._mark())
        self.bind("<Return>", lambda _e: self._mark())

        log_frame = ttk.Labelframe(mid, text="Segments log", padding=(10, 10, 10, 8))
        log_frame.grid(row=1, column=0, columnspan=2, sticky="nsew", pady=(8, 0))
        log_frame.grid_rowconfigure(0, weight=1)
        log_frame.grid_columnconfigure(0, weight=1)

        columns = ("lap", "duration", "total")
        self._tree = ttk.Treeview(
            log_frame,
            columns=columns,
            show="headings",
            height=10,
            selectmode="browse",
        )
        self._tree.heading("lap", text="#")
        self._tree.heading("duration", text="Segment duration")
        self._tree.heading("total", text="Elapsed at press")
        self._tree.column("lap", width=60, anchor="e", stretch=False)
        self._tree.column("duration", width=180, anchor="w", stretch=True)
        self._tree.column("total", width=180, anchor="w", stretch=True)

        vsb = ttk.Scrollbar(log_frame, orient="vertical", command=self._tree.yview)
        self._tree.configure(yscrollcommand=vsb.set)

        self._tree.grid(row=0, column=0, sticky="nsew")
        vsb.grid(row=0, column=1, sticky="ns")

        footer = ttk.Frame(base)
        footer.grid(row=3, column=0, sticky="ew", pady=(10, 0))
        footer.grid_columnconfigure(0, weight=1)
        hint = "Tip: press Space or Enter to mark. (Monotonic time; keeps running while open.)"
        ttk.Label(footer, text=hint).grid(row=0, column=0, sticky="w")

    def _tick(self) -> None:
        if not self._running:
            return
        now = time.monotonic()
        self._elapsed_var.set(format_hhmmss_tenths(now - self._start))
        self._segment_var.set(format_hhmmss_tenths(now - self._last_mark))
        self.after(self._refresh_ms, self._tick)

    def _mark(self) -> None:
        now = time.monotonic()
        seg = now - self._last_mark
        total = now - self._start
        self._last_mark = now

        self._lap += 1
        iid = f"lap{self._lap}"
        self._tree.insert(
            "",
            "end",
            iid=iid,
            values=(self._lap, format_hhmmss_tenths(seg), format_hhmmss_tenths(total)),
        )
        self._tree.see(iid)
        self._tree.selection_set(iid)

    def _on_close(self) -> None:
        self._running = False
        self.destroy()


if __name__ == "__main__":
    StopwatchApp().mainloop()

(You can infer I’m asking a bit more when it thinks significantly longer)


Claude Sonnet 4.5: ~20 seconds, free (not $200/mo)

1 Like

So, that sounds like 1562/28 = 55 tokens per second for the streamed response, correct? This is versus the 4 tps that I am seeing. I am on a plus subscriber plan.

Sounds like you’re getting a “steamed” response.
Might hard-refresh the browser, adblock and extensions off to make sure the app and local data get updated and sync’d. It could simply be that tokens crawl due to a software client issue (and OpenAI also has a fake stream simulator that they like to use on gpt-4.5.)

Yes, though GPT 5.2 Auto and Instant are both in the range of 50 tokens per second. So, they’d have to be doing something quite different with Thinking and Extended Thinking.

I also observe this same slowness on 5.2 thinking on the Mac Desktop app. I’m also a subscriber.

1 Like

Note that I also have all personalization / chat history turned off. I found it was mirroring me too much and I couldn’t get fresh, unbiased insights from chatgpt as it would just suggest things that I had suggested to it prior. I also have Improve model using my data turned off. I have some brief custom instructions, nothing fancy, just about 20 tokens or so.

I’ve been having the exact same problem with 5.2 Thinking for several days now. I used your prompt for testing:

The result is: 20 seconds of thinking and crazy 140 seconds of typing (719 tokens) since the first word. So it’s ~5tps of pure “typing”. Never had anything like this before.

Also a Plus subscriber.

1 Like

If it’s a rate limit, that’s fine, but tell us what it is so we can avoid triggering it. I had tried out a bunch of complex math questions I was curious about, and they frequently hit the 15 minute thinking limit.

If that was a problem, I really didn’t need to do them. It was mostly just curiosity (and a kaggle contest, but there was a better way to verify them I realized after doing them). I wasn’t aware there was some ‘hidden rate limit’.

Undocumented traffic shaping like this is not a good look. It also makes the unit economics of OpenAI very suspect if they have to resort to deceptive practices like this (assuming it’s not a bug).

It’s a bit of a shame. I was one of the last hold outs on r/singularity telling people Google taking out OpenAI is a bad thing. I dunno anymore.

I suppose it’s karma, I guess. I used to sit here and shake my head at all the people who complained about the OpenAI service. “First they came for the Communists, and I did not speak out as I was not a Communist.” Indeed!

I’m experiencing something very similar with GPT-5/5.2 Thinking —
It’s not just latency at start, but token generation feels sluggish, like ~4 tokens/sec in many sessions.
This is noticeably slower than normal and affects workflow.
Curious if this pattern is limited to certain accounts/regions or specific UI states, because not many people have reported it here yet.
Anyone else seeing the same slow inference speed on Extended Thinking?

1 Like

I am also affected in Germany with a business subscription. Codex CLI runs, but the web and Android apps generate tokens extremely slowly. I have already submitted a ticket to OpenAI, but unfortunately there is no solution in sight yet. Temporary chats work strangely enough. Therefore, I rule out any peering problems.

Benchmarks seem a bit silly at this point if they’re going to nerf things dynamically. Makes you realize how risky it is relying on this stuff.

Honestly, adjusting compute downwards without telling people on these models is Peak Evil (if it can peak), imho. People think they are imagining things, and taking advantage of that for bait and switch is one of the darkest things I’ve ever seen capital do. This could be a temporary outage, ofc.

1 Like

Full disclosure, I am now seeing about 14 tokens per second on the prompt above - noticeably faster.

Whether the problem has just shuffled to some other aspect (see above), I can’t say right now.

But the the persistent issue of slow streaming token response generation (4 tps after TTFT) that I had been seeing for the last 5 days is not happening at this time.