That’s a nice format of report. It doesn’t let us fully understand:
Answring requires flip of a permanent org switch to find out the comprehensiveness of audit logging, though. Thus I suggest ensuring discoverability first, or perhaps others can report the result and verbosity of deleting a temporarily-created key and retrieving logs.
Audit logging may let one discover their nefarious multiple personality sleepwalking disorder that likes to delete keys? Or that you are completely blameless, and this indicates impactful silent corruption of data by OpenAI.
The report did inspire me to apply and demonstrate the application of AI to my own incident report writing…
Hypothetical: Data Corruption Affecting a Subset of Multidomain API Keys
Root Cause
In March 2026, we experienced a data integrity incident affecting a small subset of customers whose organizations spanned multiple domains and sites in our multi-region Kubernetes deployment.
Our PostgreSQL cluster was deployed across multiple sites with storage layers that relied on distributed block devices. Under a specific failure mode involving cross-site replication timing and index page writes, certain index entries were written with invalid 64-bit file offsets. These offsets pointed beyond the valid 32 TB addressable file space configured for the underlying storage.
From PostgreSQL’s perspective, these writes appeared successful at the system call level. However, the invalid offsets caused sparse file regions to be created in the data files. The underlying filesystem accepted the writes by creating holes (sparse segments) rather than rejecting the operation. As a result:
- Some index pages referenced heap data that was never durably written.
- Certain rows became logically unreachable through normal index scans.
- In rare cases, subsequent VACUUM and autovacuum processes treated those rows as non-existent or removable.
- Affected records appeared to “delete themselves” over time, particularly API keys belonging to multidomain organizational units.
The corruption was limited in scope because it required a specific alignment of cross-site write timing, index updates, and storage layer behavior. It did not affect the majority of tenants or single-domain organizations.
Binary-level inspection of affected relation files confirmed the presence of sparse regions and index entries containing non-canonical block references. Some data pages contained partial or invalid content that could not be reliably reconstructed. For security reasons, if data appeared non-functional, unreachable, or semantically ambiguous, we chose not to restore it.
At the time of the incident, we lacked comprehensive guardrails to detect invalid block references at the storage boundary, and our monitoring did not include invariant checks for sparse file growth or out-of-range page offsets.
Incident Response
In early March 2026, customer reports indicated that API keys associated with certain organizational structures were disappearing without corresponding audit events.
Engineers began investigating application-layer logs and database-level audit trails. Initial hypotheses focused on application logic or key lifecycle automation. However, deeper inspection revealed inconsistencies between expected row counts and index traversal results.
We:
- Performed integrity checks using PostgreSQL tools (
amcheck, catalog validation).
- Identified index corruption localized to specific relations tied to multidomain organization units.
- Examined relation files at the filesystem level and discovered sparse file regions.
- Correlated these regions with invalid block references embedded in index pages.
Once the corruption vector was understood, we:
- Isolated affected clusters.
- Disabled cross-site write paths implicated in the issue.
- Rebuilt corrupted indexes.
- Restored valid data from physical backups and logical replication streams where possible.
Where rows were provably intact in heap pages and could be reconciled against consistent snapshots, we restored them. Where page contents were partially written, referenced invalid offsets, or conflicted with security guarantees, we did not attempt reconstruction.
Data Recovery Status
The incident affected a small subclass of API organization units operating in multidomain configurations.
Impacted objects included:
- API keys
- Associated organization-unit metadata
- Select index structures supporting key lookups
Recovered:
- API keys and metadata fully present in validated backups
- Keys reconstructable from logical replication logs
- Organization records with intact heap pages
Not recoverable:
- Rows whose heap pages were partially written into sparse regions
- Records only referenced via corrupted index entries with invalid block offsets
- Data segments that appeared unreachable and were subsequently vacuumed
For security and compliance reasons, ambiguous or partially corrupted API keys were not restored. Affected customers were required to regenerate new keys.
Our Response and What We’ve Changed
We implemented architectural and operational changes to prevent recurrence:
-
Storage Safeguards
- Enforced strict bounds validation on block-level writes.
- Added runtime checks for out-of-range file offsets before commit acknowledgment.
- Enabled filesystem-level alerts for unexpected sparse file growth.
-
Database Hardening
- Increased frequency of
amcheck index validation.
- Added background verification jobs for page-level integrity.
- Enabled stronger checksum validation and verification during replication.
-
Replication and Topology Changes
- Simplified cross-site replication paths.
- Reduced write acknowledgment complexity across sites.
- Clarified failure domains in our Kubernetes storage orchestration.
-
Monitoring Improvements
- Implemented invariants for row-count drift between heap and index scans.
- Added metrics for sparse file allocation deltas.
- Alerting on anomalous index page references.
-
Operational Resilience
- Improved documentation of database topology and recovery procedures.
- Formalized knowledge transfer processes to reduce institutional knowledge loss.
- Strengthened change management during organizational transitions.
We are also conducting a broader review of how distributed storage semantics interact with PostgreSQL’s assumptions about durable, bounded block devices.
Organizational Context
During this period, we experienced significant personnel turnover unrelated to the technical failure. This resulted in a loss of institutional knowledge around certain infrastructure decisions, including historical storage-layer tradeoffs.
While this did not directly cause the corruption, it increased time-to-diagnosis and slowed early-stage hypothesis validation. We recognize that operational continuity is a reliability concern, and we are investing in documentation, peer review, and cross-training to mitigate similar risks in the future.
We apologize for the instability and disruption this incident caused.
What This Means for You
If you were affected:
- Some API keys may have become invalid without an explicit deletion event.
- Regeneration of those keys was required.
- No evidence suggests unauthorized third-party access as a result of this issue.
- Corruption was structural, not adversarial.
If you maintain internal logs of API key provisioning or rotation, those records may help reconcile activity during the affected period.
We take data durability and integrity seriously. This incident exposed a rare but serious interaction between distributed storage behavior and database indexing assumptions. We have reinforced both technical controls and operational practices to ensure this class of failure is prevented and rapidly detected in the future.
Thank you for your patience and continued trust.