Does OpenAI honor noindex and robots.txt when scraping for training data? How does it deal with scraped personal data?

Oibaf · April 1, 2023, 2:50pm

Trying to understand more of the ongoing issue with the Italian Data Protection Authority, I’m searching for the specific info referenced in this discussion’s topic, and haven’t found any.

Does anybody know and can reference a page on the OpenAI (or related) website that clarifies the matter?

I asked ChatGPT directly about it, but I am not sure I can consider that response authoritative.

EricGT · April 1, 2023, 3:03pm

Can you be more specific by providing a link?

You are assuming that OpenAI is searching web sites for data, where did you get this or why are you making this assumption?

IIRC the training data was based on The Pile or similiar. I have seen the details noted a few times, might even be in the docs here.

Update

Here is a better breakdown of the training data. Rember that ChatGPT is based on GPT-3.5.

Also see this

anon10827405 · April 1, 2023, 3:05pm

Unfortunately we don’t know anymore.

This is why transparency is so important.
We used to know, but they have decided to not release their training data anymore

Hopefully this situation causes OpenAI to go back to being “OpenAI”

I would be shocked if they ignored robots.txt though.

Oibaf · April 1, 2023, 3:22pm

I was referring to the topic of the discussion we’re having, didn’t want to restate the question. I guess this back and forth between me and you made that aim pointless.

Thanks for the pointers, I will have a look.

EDIT: well, can’t say those articles actually gave responses to my questions, though.

Oibaf · April 1, 2023, 3:24pm

Releasing the training data would be great, but not necessary in the light of knowing what I am looking for. Does anything like the “Responsible AI use policy” referred to ChatGPT in its response exist at all, or has it dreamed it up completely?

anon10827405 · April 1, 2023, 3:28pm

Hard to say. I tried looking (very briefly) through their documentation on safety without much success.

They seem to be continuously tuning ChatGPT without indicating in what fields or information. It could be that they decided to “give” it this knowledge. To be safe I’d say that it’s hallucinated until it’s proven otherwise.

I don’t think you’ll find any concrete answer unfortunately. They have been very secretive lately.

Good luck in your search. If you are gathering evidence to fight the ChatGPT ban, I’m more than happy to help. Send me a PM if you’re interested.

Oibaf · April 1, 2023, 4:17pm

Well, given the ongoing investigation in Italy, they will have to disclose at the very least the info I am looking for. To be clear, I find the letter of the Italian Data Protection authority to be based on shaky grounds - heck, it literally begins with saying that OpenAI doesn’t provide any privacy notice, which is blatantly not true - but I am looking for some more evidence to back some of my stances up.

anon10827405 · April 1, 2023, 4:21pm

Completely agree. Their arguments are so frustratingly weak that in my opinion, it’s banning them, and then finding the reasons afterwards to justify it.

Something that I saw in the other article that is very important is that Google was also sued for similar reasons. Their argument was that “they are a public utility”. I don’t think it’s far-fetched to say that ChatGPT is also a public utility.

Topic		Replies	Views
Italy orders ChatGPT blocked citing data protection concerns Community	4	1516	December 17, 2023
Why was my CustomGPT de-listed? Plugins / Actions builders chatgpt , plugin-development	39	2478	June 15, 2024
Dealing with cybersecurity concerns from a misinformed IT department Community gpt-4	11	3015	March 30, 2024
Learning More About GPTs Browsing Functionality GPT builders	2	1519	January 29, 2024
Some questions on copyrighted material Community gpt-4 , chatgpt	26	4662	October 3, 2023

Does OpenAI honor noindex and robots.txt when scraping for training data? How does it deal with scraped personal data?

Related topics