Codex Cloud: Persistent Session Failures, Container Errors & 502 Infrastructure Issues

Hi all,

I’m encountering a series of repeated failures when working with Codex Cloud sessions, and it appears to be an infrastructure-level issue rather than something in my code or workflow. I wanted to document the behaviour here in case others are experiencing the same thing—or if the OpenAI team needs a reproducible report.

What I’m Seeing

Across multiple attempts to run or resume Codex Cloud sessions, I receive variations of the following:

  • “Error: Failed to read output from session ‘shell’. This session may be corrupt. Please start a new session.”

  • Attempts to reopen the session result in Codex thinking the session might still be alive, then failing to connect.

  • Starting a new session also fails intermittently.

  • Codex repeatedly speculates about session limits, container-tool recovery, or infrastructure availability, but every retry leads to the same outcomes.

  • Heartbeat checks and feed_chars attempts don’t revive the session.

  • Switching to a different session name (e.g., “session1”) fails as well.

  • The system sometimes reports 502 CAAS errors, blocking any container interactions.

  • When trying fallback strategies (minimal scripts, subprocess checks, specifying ports, waiting between retries), the environment consistently fails to start.

Observed Pattern

  • Sessions appear to become corrupt or unreachable.

  • New sessions fail to initialise.

  • The container environment intermittently reports infrastructure errors.

  • No code changes can be made because the tool never becomes available.

  • Codex itself ultimately concludes it cannot continue due to environment failure.

Impact

  • No commits or changes can be pushed.

  • Tests cannot be run.

  • Any operation requiring a container session effectively stalls.

Can someone from the OpenAI team confirm whether:

  1. There is a known outage or degradation affecting Codex Cloud sessions or CAAS infrastructure?

  2. There are new session limits or behavioural changes that we should be aware of?

  3. There are recommended recovery steps beyond the standard “start a new session” flow?

Happy to provide timestamps or additional logs if helpful.

Thanks in advance—keen to resume work once the environment settles.