30K Tokens Per Minute limit vs 128K+ Context Models – Is Long-Context Usage Actually Possible via API?

I’m developing an open-source conversational client (Tether) designed for sustained dialog with long-term relational memory, using GPT-4o and GPT-5.1 via the /v1/responses endpoint.

As you know, these models support context windows of 128K+ tokens, which is a major feature. However, the current TPM (tokens-per-minute) limit is 30K, even on paid accounts.

That means that sending a single request containing 110K tokens of history (perfectly within model context capacity) is rejected, because the TPM limit is lower than the size of the request itself.

In practice, on request (every few hours) is made of:

  • Ancient curated memory: ~8–10K

  • Rolling live context: ~100–120K

  • Prompt instructions & developer guidance: ~1K+

Total: ~115–130K tokens per request, which should be supported by model capacity…
…but is not allowed by API throughput settings.

Questions

Are there any API usage patterns or recommended architectures that allow long-context usage without exceeding TPM limits?

Is this a temporary constraint, or an intentional limitation (e.g. cost / resource protection)?

Is the Assistants/Threads API currently able to bypass this (e.g. server-side context caching), or does it face the same TPM restriction on run execution?

Would enterprise “Scale Tier” be the only way to realistically use full 128K context conversationally?

Clarification

I’m fully aware that:

  • Higher TPM can be allocated under business/enterprise agreements,

  • Long-context inputs are expensive to process,

  • Sequential rate limits protect system reliability.

However, the current default setup makes 99% of the available context unusable for sustained dialog, which seems contradictory to the promotion of extended context models.

This is like having a racing engine limited to 20 mph.

Are we missing an architectural instruction? Or is full long-context usage intentionally reserved for enterprise tiers only?

Any precise guidance or clarification would be appreciated, especially on whether token-pay-per usage should eventually enable larger TPM allocation without enterprise pricing.

Thanks in advance to anyone with technical or policy insight on this.

2 Likes

Hi,

Welcome to the OpenAI Developer Forum.

As I understand it…

https://platform.openai.com/settings/organization/limits.

Rate limits increase based on Tier Level (I am Tier 4)

https://platform.openai.com/docs/guides/rate-limits.

Tier Level is determined by how much you have historically spent and how much time since first deposit.

TPM is enforced per request, so if your tier gives you 30K TPM, you cannot send a request larger than ~30K tokens. Even if the model supports a 128K context window, your TPM limit still caps the size of any single request.

1 Like

Thanks a lot for your detailed reply. I checked the tier description, and it seems that I need to be at least “Tier 2” to get a TPM limit of 450,000 tokens with GPT4o and GPT5.1. This will enable me to use the full context of these LLMs. I have been experimenting with the API for a couple of weeks, but spending only a few bucks on it and I am still Tier 1. Therefore, I just need to make another deposit of more than $50 to make it work. It seems the solution is not in the code but in the credit card. Thanks for the explanation.

1 Like

GPT-5 and GPT-5.1 have a bit more generous 500,000 token-per-minute rate limit at Tier-1 than prior models, where you might be unable to even make a single API call at “Tier 1” with gpt-4o. This was recently boosted.

You can scroll to the bottom of a particular model’s “page” and see per-tier.

https://platform.openai.com/docs/models/gpt-5.1

For tier elevation, don’t pay right away - that second payment will need to be after 7 days have elapsed, and if you already paid a minimum initial payment, just $45 to bring your cumulative lifetime payment above $50.

1 Like