The batches basically does the job of running API calls for you. The job is sent to the API endpoint you specify manually. Just on OpenAI’s schedule, and to a magic endpoint that gives a discount.
Therefore, the 30 day API retention would apply to the model inputs of running the job.
You can delete the input and output file after your job is done. The metadata of it running likely remains forever. I could probably track some staff citation of how long files are retained, but you can imagine that blob storage has several rounds of versioning and backup so one person’s answer might not be the “can no longer be recovered by the government security letter” answer.
Thank you for the hints. My main question is how we make sure that all the data that we sent to the API is deleted.
We can cancel an ongoing batch by using client.batches.cancel(“batch_abc123”)
And we can delete a file by using: client.files.delete(“file-abc123”)
Would that be sufficient to ensure that nothing is retained in the server?`
The language of help and terms and the Data Processing Addendum (which you then submit to make binding) does not address “sent through the API” when what is sent is placed into a semi-permanent storage system…
OpenAI may securely retain API inputs and outputs for up to 30 days to provide the services and to identify abuse. After 30 days, API inputs and outputs are removed from our systems, unless we are legally required to retain them. You can also request zero data retention (ZDR) for eligible endpoints if you have a qualifying use-case. For details on data handling, visit our Platform Docs(opens in a new window) page.
This DPA shall remain in effect as long as OpenAI carries out Customer Data processing operations on Customer’s behalf or until the termination of the Agreement (and all Customer Data has been returned or deleted in accordance with this DPA). OpenAI will retain API Service Customer Data sent through the API for a maximum of thirty (30) days, after which it will be deleted, except where OpenAI is required to retain copies under applicable laws, in which case OpenAI will isolate and protect that Customer Data from any further processing except to the extent required by applicable laws. OpenAI will retain ChatGPT Enterprise Service Customer Data during the term of the Agreement, unless otherwise stated in the Agreement or Order Form. On the termination of the DPA, OpenAI will direct each Subprocessor to delete the Customer Data within thirty (30) days of the DPA’s termination, unless prohibited by law. For clarity, OpenAI may continue to process information derived from Customer Data that has been deidentified, anonymized, and/or aggregated such that the data is no longer considered Personal Data under applicable Data Protection Laws and in a manner that does not identify individuals or Customer to improve OpenAI’s systems and services.
The closest we get is enterprise privacy FAQ: “Data submitted to fine-tune a model is retained until the customer deletes the files.” where a batch JSONL follows a similar path.
PS: files to make GPTs are retained forever, and you sign over rights for OpenAI to do whatever they want with them.
Thank you. That also brings a question, which I will pose in a different post, which is the part that says:
“OpenAI may continue to process information derived from Customer Data that has been deidentified, anonymized, and/or aggregated such that the data is no longer considered Personal Data under applicable Data Protection Laws and in a manner that does not identify individuals or Customer to improve OpenAI’s systems and services.”
this seems in contradiction to " We do not train on your business data (data from ChatGPT Team, ChatGPT Enterprise, or our API Platform)"
In other words, OpenAI apparently DO train on our data, but it is deidentified first. This can cause a lot of problems with Legal.
“training” → language sent to knowledge workers for grading and inclusion in future AI models
from information derived from…
“keeping anonymized data” → “query: what percentage of use hit 4k output on gpt-4?”, or “how many calls were sent to Azure East?”, “percentile safety violation count by naferreira_7”.
I think the phrase in their clause is very vague (and perhaps intentionally):
“process information derived from Customer Data that has been deidentified, anonymized, and/or aggregated such that the data is no longer considered Personal Data”
What kind of information derived from Customer Data? It could be any kind and it can go beyond “percentage of use” or “how many calls”.
It’s also vague on how they will use this data “improve OpenAI’s systems and services”. This could well include model training.