Does OpenAI honor noindex and robots.txt when scraping for training data? How does it deal with scraped personal data?

Trying to understand more of the ongoing issue with the Italian Data Protection Authority, I’m searching for the specific info referenced in this discussion’s topic, and haven’t found any.

Does anybody know and can reference a page on the OpenAI (or related) website that clarifies the matter?

I asked ChatGPT directly about it, but I am not sure I can consider that response authoritative.

Can you be more specific by providing a link?

You are assuming that OpenAI is searching web sites for data, where did you get this or why are you making this assumption?

IIRC the training data was based on The Pile or similiar. I have seen the details noted a few times, might even be in the docs here.


Here is a better breakdown of the training data. Rember that ChatGPT is based on GPT-3.5.

Also see this

Unfortunately we don’t know anymore.

This is why transparency is so important.
We used to know, but they have decided to not release their training data anymore

Hopefully this situation causes OpenAI to go back to being “OpenAI”

I would be shocked if they ignored robots.txt though.

I was referring to the topic of the discussion we’re having, didn’t want to restate the question. I guess this back and forth between me and you made that aim pointless. :smiley:

Thanks for the pointers, I will have a look.

EDIT: well, can’t say those articles actually gave responses to my questions, though.

Releasing the training data would be great, but not necessary in the light of knowing what I am looking for. Does anything like the “Responsible AI use policy” referred to ChatGPT in its response exist at all, or has it dreamed it up completely?

Hard to say. I tried looking (very briefly) through their documentation on safety without much success.

They seem to be continuously tuning ChatGPT without indicating in what fields or information. It could be that they decided to “give” it this knowledge. To be safe I’d say that it’s hallucinated until it’s proven otherwise.

I don’t think you’ll find any concrete answer unfortunately. They have been very secretive lately.

Good luck in your search. If you are gathering evidence to fight the ChatGPT ban, I’m more than happy to help. Send me a PM if you’re interested.

Well, given the ongoing investigation in Italy, they will have to disclose at the very least the info I am looking for. To be clear, I find the letter of the Italian Data Protection authority to be based on shaky grounds - heck, it literally begins with saying that OpenAI doesn’t provide any privacy notice, which is blatantly not true - but I am looking for some more evidence to back some of my stances up.

Completely agree. Their arguments are so frustratingly weak that in my opinion, it’s banning them, and then finding the reasons afterwards to justify it.

Something that I saw in the other article that is very important is that Google was also sued for similar reasons. Their argument was that “they are a public utility”. I don’t think it’s far-fetched to say that ChatGPT is also a public utility.