Service Account API keys Disappearing

Hi everyone,

I’m running into a recurring issue where our OpenAI API keys keep disappearing.

The key is an API key created for a service account. It is stored only as a secret in Google Cloud and used exclusively by backend functions that have access to that secret. The key is not exposed anywhere else and is never embedded in client applications.

Despite this setup, the key periodically stops working because it appears to have been removed or invalidated. This is heavily impacting our production applications since services suddenly fail when the key disappears.

Has anyone experienced something similar or knows what might be causing this? Any guidance on where to look or how to prevent this would be greatly appreciated.

Thanks in advance.

1 Like

Hi and welcome back!

The most straightforward explanation would be that someone with an admin key is revoking these keys.

That said, we recently had a similar report where this possibility was ruled out. The discussion also includes several suggestions on where to check for unexpected key deletion.

1 Like

This just happened to all of the API keys in our account. Both older and newer keys are gone. I’m the account owner/admin, and I assure you that I did not delete any API keys. The keys are not exposed publicly in any way, and reside encrypted on our servers.

1 Like

I have flagged this to team. In the meantime I am attempting to repro the issue.

Here’s the test: Make three keys. Never use or even copy out two of them.

If the one key in use is disappearing and the others remain - you are leaking it, which might be some key auto-reporting done by one of the larger repos or hosts with agreements with OpenAI.

Hi all, thanks for the responses so far.

Just to clarify a few important points:

  • I am the only admin/owner of the account.

  • No one else has access to delete or rotate keys.

  • The keys are not exposed publicly in any way.

  • They are stored exclusively as secrets in Google Cloud using firebase functions:secrets:set <SECRET_NAME> and are only accessed by backend Cloud Functions.

  • They are never embedded in client apps, repos, CI logs or anywhere else.

I’ve now tested this with both service account keys and non-service account keys, the same issue occurs.

More generally, even if a leak were detected, automatically deleting a production API key without explicit confirmation from the account owner is extremely disruptive. At the very least there should be a toggle or policy setting that lets the account owner decide whether to:

  • auto-revoke immediately

  • notify only

  • or require manual confirmation

In our case, keys are disappearing without warning, which is heavily impacting production systems.

Is there any way to escalate this to support for a specific account investigation? We are on a non-enterprise plan, so we don’t have direct support access but this is a production-impacting issue and we would really love to have some clarity on what is happening.

Appreciate any help.

Audit logging. Once turned on, it is permanently on.

Then:

The only thing the events don’t specifically detail is that api_key.deleted would include admin keys - but you can record the value and delete one yourself, along with deleting a project’s API key, and see that you then have logging.

The mystery will then only be in the future if it is “you” who is doing the logged deleting - then you have to go on full lockdown (intrusion without other detected abuse or billings would be a strange thing).

The audit logs would be the first place to check. In the other topic I linked above, the OP mentioned they could not find any audit log entries related to the missing key.

Today, OpenAI published an update on the status page stating that the audit logs were affected by a bug and that the missing data is currently being backfilled.

That may help clarify what happened.

1 Like

That’s a nice format of report. It doesn’t let us fully understand:

  • are admin keys or service keys also logged?
  • how these keys are actually going missing.

Answring requires flip of a permanent org switch to find out the comprehensiveness of audit logging, though. Thus I suggest ensuring discoverability first, or perhaps others can report the result and verbosity of deleting a temporarily-created key and retrieving logs.

Audit logging may let one discover their nefarious multiple personality sleepwalking disorder that likes to delete keys? Or that you are completely blameless, and this indicates impactful silent corruption of data by OpenAI.

The report did inspire me to apply and demonstrate the application of AI to my own incident report writing…


Hypothetical: Data Corruption Affecting a Subset of Multidomain API Keys

Root Cause

In March 2026, we experienced a data integrity incident affecting a small subset of customers whose organizations spanned multiple domains and sites in our multi-region Kubernetes deployment.

Our PostgreSQL cluster was deployed across multiple sites with storage layers that relied on distributed block devices. Under a specific failure mode involving cross-site replication timing and index page writes, certain index entries were written with invalid 64-bit file offsets. These offsets pointed beyond the valid 32 TB addressable file space configured for the underlying storage.

From PostgreSQL’s perspective, these writes appeared successful at the system call level. However, the invalid offsets caused sparse file regions to be created in the data files. The underlying filesystem accepted the writes by creating holes (sparse segments) rather than rejecting the operation. As a result:

  • Some index pages referenced heap data that was never durably written.
  • Certain rows became logically unreachable through normal index scans.
  • In rare cases, subsequent VACUUM and autovacuum processes treated those rows as non-existent or removable.
  • Affected records appeared to “delete themselves” over time, particularly API keys belonging to multidomain organizational units.

The corruption was limited in scope because it required a specific alignment of cross-site write timing, index updates, and storage layer behavior. It did not affect the majority of tenants or single-domain organizations.

Binary-level inspection of affected relation files confirmed the presence of sparse regions and index entries containing non-canonical block references. Some data pages contained partial or invalid content that could not be reliably reconstructed. For security reasons, if data appeared non-functional, unreachable, or semantically ambiguous, we chose not to restore it.

At the time of the incident, we lacked comprehensive guardrails to detect invalid block references at the storage boundary, and our monitoring did not include invariant checks for sparse file growth or out-of-range page offsets.


Incident Response

In early March 2026, customer reports indicated that API keys associated with certain organizational structures were disappearing without corresponding audit events.

Engineers began investigating application-layer logs and database-level audit trails. Initial hypotheses focused on application logic or key lifecycle automation. However, deeper inspection revealed inconsistencies between expected row counts and index traversal results.

We:

  • Performed integrity checks using PostgreSQL tools (amcheck, catalog validation).
  • Identified index corruption localized to specific relations tied to multidomain organization units.
  • Examined relation files at the filesystem level and discovered sparse file regions.
  • Correlated these regions with invalid block references embedded in index pages.

Once the corruption vector was understood, we:

  • Isolated affected clusters.
  • Disabled cross-site write paths implicated in the issue.
  • Rebuilt corrupted indexes.
  • Restored valid data from physical backups and logical replication streams where possible.

Where rows were provably intact in heap pages and could be reconciled against consistent snapshots, we restored them. Where page contents were partially written, referenced invalid offsets, or conflicted with security guarantees, we did not attempt reconstruction.


Data Recovery Status

The incident affected a small subclass of API organization units operating in multidomain configurations.

Impacted objects included:

  • API keys
  • Associated organization-unit metadata
  • Select index structures supporting key lookups

Recovered:

  • API keys and metadata fully present in validated backups
  • Keys reconstructable from logical replication logs
  • Organization records with intact heap pages

Not recoverable:

  • Rows whose heap pages were partially written into sparse regions
  • Records only referenced via corrupted index entries with invalid block offsets
  • Data segments that appeared unreachable and were subsequently vacuumed

For security and compliance reasons, ambiguous or partially corrupted API keys were not restored. Affected customers were required to regenerate new keys.


Our Response and What We’ve Changed

We implemented architectural and operational changes to prevent recurrence:

  1. Storage Safeguards

    • Enforced strict bounds validation on block-level writes.
    • Added runtime checks for out-of-range file offsets before commit acknowledgment.
    • Enabled filesystem-level alerts for unexpected sparse file growth.
  2. Database Hardening

    • Increased frequency of amcheck index validation.
    • Added background verification jobs for page-level integrity.
    • Enabled stronger checksum validation and verification during replication.
  3. Replication and Topology Changes

    • Simplified cross-site replication paths.
    • Reduced write acknowledgment complexity across sites.
    • Clarified failure domains in our Kubernetes storage orchestration.
  4. Monitoring Improvements

    • Implemented invariants for row-count drift between heap and index scans.
    • Added metrics for sparse file allocation deltas.
    • Alerting on anomalous index page references.
  5. Operational Resilience

    • Improved documentation of database topology and recovery procedures.
    • Formalized knowledge transfer processes to reduce institutional knowledge loss.
    • Strengthened change management during organizational transitions.

We are also conducting a broader review of how distributed storage semantics interact with PostgreSQL’s assumptions about durable, bounded block devices.


Organizational Context

During this period, we experienced significant personnel turnover unrelated to the technical failure. This resulted in a loss of institutional knowledge around certain infrastructure decisions, including historical storage-layer tradeoffs.

While this did not directly cause the corruption, it increased time-to-diagnosis and slowed early-stage hypothesis validation. We recognize that operational continuity is a reliability concern, and we are investing in documentation, peer review, and cross-training to mitigate similar risks in the future.

We apologize for the instability and disruption this incident caused.


What This Means for You

If you were affected:

  • Some API keys may have become invalid without an explicit deletion event.
  • Regeneration of those keys was required.
  • No evidence suggests unauthorized third-party access as a result of this issue.
  • Corruption was structural, not adversarial.

If you maintain internal logs of API key provisioning or rotation, those records may help reconcile activity during the affected period.


We take data durability and integrity seriously. This incident exposed a rare but serious interaction between distributed storage behavior and database indexing assumptions. We have reinforced both technical controls and operational practices to ensure this class of failure is prevented and rapidly detected in the future.

Thank you for your patience and continued trust.