Avoiding throttling during peak hours?

Lately at peak hours (like right now), there’s so much strain on OpenAI’s servers that it renders the API unusable.

My stats from yesterday:

  • 500+ calls to gpt-5
  • 0 timeouts
  • $6.08 in charges

My stats from the last 20 min (I haven’t changed my code whatsoever):

  • 24 calls to gpt-5
  • Around 50% timed out (I set a 3-minute limit)
  • $2.33 in charges

Since server strain doesn’t show up in the API status dashboard, is there somewhere else we can monitor it? It would be very helpful to know what times of day and week are overloaded like this, so I can avoid doing research/engineering at these times.

I’m also making use of prompt caching to prepend a large document to the beginning of each prompt, so it’s possible that this service is down. The way this works is that OpenAI detects when you have the same large prefix across multiple prompts, automatically caches it on their end, and purges the cache if you don’t hit it after some period of time (eg. 5 minutes). This is automatic and OpenAI manages it on their end, users have no control over it.

First: devise a performance test:

See the maximum output tokens specified being reached at the bottom of the screenshot (but an empty assistant message…) Launch a few trials in parallel and measure the streaming rate and initial latency to first token. You can inject different starting text to break caching, but that is just so you don’t get variation in followups, not a test of “cache is down”.

Then what you do: perform the same evaluation after setting “service_tier”: “priority” to pay twice as much.

See if it isn’t the case that: instead of being offered a discount for “flex” processing of low priority on o3, you now are defaulted to poor performance and have to pay double to get back to an output that is not halved in output production speed.

Use the API parameter for “prompt_cache_key” to ensure routing to a cached server, but don’t overwhelm it with parallel calls, or you’ll be rotated out.

Welcome to the dev forum, @carlson.

Currently, there’s no accessible interface from OpenAI that can be used to see server usage peak times.

However, if guaranteed timely responses are a requirement, I’d recommend looking into the priority service tier. Once you have it for your org, you can simply set the service_tier param to priority to ensure the requests get processed in a timely manner.

3 Likes

Thanks @sps, this helped, with a few caveats. Sharing here to help others:

  1. First of all, OpenAI docs say that the priority service tier is only available to enterprise customers. You can see this on the pricing page you shared and the FAQ. However as I was reading through the docs, I noticed that this page doesn’t mention anything about enterprise restrictions, so I tried running the sample code anyway and it worked, even though I don’t have an enterprise account. So it seems like OpenAI expanded access without telling anyone.
  2. Secondly, I investigated my usage stats more closely: if you go in the API usage dashboard and choose Group by > Line Item, you can view a summary of charges by input and output tokens. For some inexplicable reason that only OpenAI can know, about 50% of my cost per API call was input tokens vs. 50% output tokens last week (which makes sense based on the program), and now that ratio has shifted to 20% input tokens and 80% output tokens for the same API calls. In other words: the API calls are suddenly taking 4-5x longer and costing 4-5x more, simply because the model is doing a lot more thinking to complete the same tasks. I can actually see a steady trend over the last week, where the model has been slowly increasing its output tokens per task each day.

@_j unfortunately my use case requires parallel calls. However per the above you were right, I can confirm that prompt caching is working because the input tokens have remained constant, but rather we’re forced to pay double to get back to an output that is not halved.

Oh, and one more thing: OpenAI team, if you’re reading this, I strongly encourage you to ensure that your models have accurate knowledge of the OpenAI API and dev tools. It is such low hanging fruit and would make it so much easier for people to adopt your product.

One thing to note: it takes call completion and perhaps a bit more time for a context window cache to be available for further calls.

You should make a scheduler that produces one “warm-up” call for the batch that is awaited before proceeding with the rest.

OpenAI says “about 15 parallel calls per minute” is when you’d be distributed to other fulfillment servers, but that also likely depends on the job run time. I would schedule no more than 5 async workers for batching the rest.

This info is not specific enough to be actionable. Can you cite where in the docs you’re getting this from? For example:

  • How much time do you need to wait after making the first call for the prompt prefix to be cached?
  • How long does the context window remain cached without getting any further calls? I have heard anywhere from 5 minutes to 1 hour anecdotally.

I’m running batches of 5 parallel calls right now but planning to bump up to 10 and see what happens. If what you’re saying is true then the costs will really start to add up :sleepy_face:

It is actionable enough that if you blast out 50 calls in parallel, and they all get ingested at the same time, the cache hit rate for those is going to be zero. Each will have no hit on its hash, have nothing to retrieve, and each would independently build a full-price k-v store from AI processing based on input, which is what is being discounted.

So don’t do that.

Do what is logical for success.

Your docs are here.