Wfr_ errors in Responses API background mode + correlation with code_interpreter-heavy workflows?

I’ve been running into this issue over the past day (or two more, don’t know how long it’s been around today but started noticing it last night), and it’s been super weird:

“An error occurred while processing your request. You can retry your request,
or contact us through our help center at help.openai.com if the error persists.
Please include the request ID wfr_… in your message.”

This fires in a way that’s not rate-limited (no 429, no rate_limit_exceeded code, no usage warning headers). There’s no error.code field identifying the failure class — just the wfr_ prefix on the request ID as the only distinguishing signal. I can’t find it in the API docs, error codes guide, or release notes.

What I’m observing

Failures correlate very strongly with workflows that invoke code_interpreter heavily. Specifically: single turns where the model makes 10+ Code Interpreter calls (e.g., generating a spreadsheet with multiple tabs, reading back values, writing PDFs via LibreOffice). Simple turns — plain text, web_search, other stateless tools — rarely hit this.

Retrying the same turn often produces a fresh wfr_ ID with the same generic error. So it’s not transient noise on a single request — something persistent in OpenAI’s backend is rejecting the work for as long as ~60 seconds at a time, then clearing.

Things I’ve ruled out

  • Rate limiting (distinct error class, no 429)
  • Our request shape (same shape works sometimes, fails others; not a payload issue)
  • Transient network (we have retry + responses.retrieve() recovery)
  • SDK version (seen across recent openai python SDK versions)
  • Context size alone (happens even with moderate context)

What I’ve tried

  • Exponential backoff retry, up to 5 attempts with ~60s total window
  • Lower compact_threshold values for server-side compaction
  • responses.retrieve() recovery after stream drops
  • Patching orphaned function_call items (no “tool output not found” errors)

Questions for the community

  1. Anyone else seeing wfr_ clustering specifically on code_interpreter-heavy turns?
  2. Has anyone gotten a straight answer from OpenAI on what wfr_ actually means or where it originates in their pipeline?
  3. Any known workarounds beyond “retry and hope”?
  4. Any public guidance on container allocation / sandbox limits for code_interpreter? Wondering if this is a container-allocation issue surfacing as a generic error.
  5. If this is tied to code_interpreter infrastructure specifically, are there tier differences in container availability?