The following OpenAI postmortem report is very interesting for developers because it reveals a few architectural details (thank you OpenAI!) which you may or nay not have seen before in other reports.
- OpenAI “primary” database is PostgreSQL, provided by a “cloud service” (assuming Azure),
- OpenAI uses PgBouncer for PostgreSQL DB connection pooling.
- OpenAI priorities
Full OpenAI Postmortem Report (link above and in reference at the end)
Starting at 11:15pm PST on Feb 20, 2023 we suffered a major outage across all endpoints and models of our service. /v1/completions traffic was restored by 2:05 am PST on Feb 21. Ongoing database instabilities left some non Completions services degraded until 1:30pm PST on Feb 21. The root cause was an unexpected database failure with compounding effects delaying full recovery.
Our primary Postgres database was scheduled for a routine automated maintenance by our cloud provider. During this window, no impact was expected because the primary would fail over to a hot standby to continue to serve traffic. This ran as expected. However, additional “read replica” databases were also scheduled for maintenance in the same time window but had no corresponding fail-over functionality. The read replicas unexpectedly failed to come back after their maintenance window. The result was that we were able to keep parts of the site up and running, but not enough database capacity to service all traffic.
Engineers immediately started two streams of work. One was to create new read replicas. The other was to rebalance the available database connections towards the most critical API endpoints. Completions and authentication traffic were prioritized. This brought back traffic to the main /v1/completions endpoints by 2:05 am PT on Feb 21. The length was compounded by existing read replicas unexpectedly getting stuck in a recovery loop and being unable to reconnect to the primary database. This was additionally compounded by delays incurred spinning up new replicas due to unexpected dependencies from non-on-call staff in the middle of the night on a holiday weekend and during a holiday Monday on-call shift change.
Throughout the night we continued to see database instabilities; however, isolation procedures were able to keep /v1/completions online. Unfortunately due the volume of ChatGPT and Dall-E these products remained degraded due to database instabilities.
We use PgBouncer to pool database connections; unfortunately in the post-recovery database configuration we identified previously unknown slow queries hogging the pools, preventing other queries to run. Database instabilities were further frustrated by new read replicas inadvertently bypassing PGBouncer and exceeding database connection limits. This caused an additional brief outage from 10:43 am to 11:04 am PT.
By 1:30pm read replicas were fully online and ChatGPT and DALL·E returned to a fully operational status.
We are immediately executing on several action items as a result of this outage:
- We are working with our cloud provider to be better prepared for how future maintenance will affect read replicas to minimize impact.
- Whenever possible, we will adjust maintenance windows to occur during normal working hours so more staff is readily available
- We are adjusting caches to persist with longer TTLs if the database is unavailable, allowing some critical endpoints to continue functioning for longer.
- PgBouncer configuration is being tuned to account for slow queries
- We are moving our on-call rotation to Tuesdays to avoid scheduling confusion of 3-day-weekend holidays.
- We are reviewing our on-call access policies and escalation channels to ensure that on-call has knowledge and access to all necessary dependencies to remediate an outage
Longer term, we are reviewing our overall database strategy and planning towards solutions that are more resilient to individual server failures.
Hope this helps.