I’m currently integrating Background Mode with webhooks in a production system, and I’m trying to understand how stable and scalable it is under heavy usage. Are there expected delayes for multiple requests happening?
My questions:
Is Background Mode stable for high-volume workloads?
Is it still somewhat “best effort” compared to synchronous or streaming requests? I’m noticing that some background requests take longer than expected to complete, while others return very quickly. Since our service is customer-oriented, these occasional long delays make me concerned about the overall user experience
Is there any recommended architecture to avoid lost or stuck jobs?
We’re relying on webhooks to handle the GPT API responses so that the front-end reads only from our database. This ensures that even if the user closes the application or the app crashes unexpectedly, the request will still be processed without any issues and the data are store in the database. Is this a stable approach?
Any limits or best practices for systems that may run thousands of background requests per hour?
I want to ensure I’m not misusing the feature or expecting guarantees that Background Mode isn’t designed to provide.
You have a good question, where the answer must always be:
"background": true in Responses will not change the priority of your AI task or how it is run, and there should be no delays except as are inherent in polling or in separately transmitting a webhook.
Unlike batches, or with the “service_tier” parameter, you are not requesting or expecting any different level in service and reliability. You are simply allowing the generation to continue after an initial closed connection, with a method for retrieval now offered instead of simply a bill. Under-performance that you can characterize as relating to this mode should be reported.
You need not use webhooks at all. You can poll the responses endpoint with an ID when you also have the necessary “store”: true, and receive a status or output, instead of calling when it guaranteed. This also can be your fallback, recording the object received from instantiating the call, and going after any response ID that has an unreasonable time with an unresolved webhook, i.e. don’t act exclusively on successful hook transmission. Decide if you’ll dispatch another parallel retry for satisfaction.
Architecturally, you can anticipate client abandonment of a stream recipient or a listener of your own “hook” in any call - and still offer their chat session when they revisit, where it sounds as though you’ve already thought this out.
The webhook system can have faults, or new bugs, like anything. For example, image generation failure by safety policy would leave a webhook and your user unanswered, which took an extended period for OpenAI to fix.
The limits are the same as any model call from an organization - you can be rejected for tokens-per-minute, failure which should be delivered as an immediate disposition, so is unrelated to “background”. The API call limits for polling are also high, until Cloudflare thinks you are executing a DoS attack. Just don’t run yourself out of ports or resources or miss transmissions. I’ve never seen a maximum pending webhooks spec, and if so, it better be by refusal at the edge limiter.
For load distribution when cache hit is not anticipated, or for encouraging cache hits and faster generation, you can investigate varying the “prompt_cache_key” parameter as a technique for performance.