CHATGPT scraping news site, UA and third parties

Hi, I work with several news sites and some of them are kind of panicking and blocking ChatGPT mostly to avoid content scraping. I have a couple of technical questions.

While it seems is kind of possible to scrap a web site with ChatGPT (specially with the paid version), I want to understand how exactly third party plugins works.

  1. ChatGPT will be using for his own web browser the ChatGPT-User token in the user-agent (UA). A third party will share the same token on the UA or they need to use their own?
  2. Third party plugins will also be egressing from the same IP range? 23.98.142.176/28
  3. Can we block all third party plugins except for the Open AI web browser plugin?
  4. Is there any cloacking policy in place? For example, to present a different content to ChatGPT only (for example, a special summary).
  5. Is there any statistic provided to see how much content ChatGPT-User is using on a site?

All this will be very helpful to understand how this works and make the right decisions :slight_smile:

Thanks in advanced!

Sorry. I completely misread your post. Hope you find your answer. Don’t see any of this information in the docs.

Thinking further. You could identify ChatGPT by capturing the user agent. Using that you can record the instance and deliver different content

1 Like
  1. See my follow-up post. Not sure if this is documented anywhere. You could probably find it out by setting up a “honeypot” website and directing ChatGPT with the browser plugin enabled to go to your website. Third-party plugins may not follow any of the rules that OpenAI have set for themselves.
  2. Third-party plugins are hosted by the third party, so if they call out to the internet they will be coming from random IP addresses. For example, my plugin is hosted on AWS lambda and there are a variety of different IP addresses that it uses.
  3. Use the same techniques that you would use to block any other unauthorised scraping of your websites.
  4. If you can detect that someone is scraping your site then you could show them different content.
  5. No, the only information you would get is that something has accessed the page.

With things like headless browsers, it’s very difficult to tell the difference between a real person and a machine browsing your website.

1 Like

Actually, it is well documented:

To respect content creators and adhere to the web’s norms, our browser plugin’s user-agent token is ChatGPT-User and is configured to honor websites’ robots.txt files. This may occasionally result in a “click failed” message, which indicates that the plugin is honoring the website’s instruction to avoid crawling it. This user-agent will only be used to take direct actions on behalf of ChatGPT users and is not used for crawling the web in any automatic fashion. We have also published our IP egress ranges. Additionally, rate-limiting measures have been implemented to avoid sending excessive traffic to websites.

2 Likes

If you’re concerned about scraping, there’s really two areas of concern:

  • ChatGPT with Browsing models, which as @iamflimflam1 said, should be straightforward to restrict (either through robots.txt or IP blocking)
  • Plugins that do their own browsing, like WebPilot or others. These plugins are making requests from their servers and their own code, so up to each plugin to respect robots.txt. Theoretically this is something OpenAI could have policies around for plugin approval, but realistically hard for them to enforce.
1 Like

Hi, I saw that document. My concern if for third party plugins. Can a user instruct chatgpt to get the content using the CHATGPT OPEN AI BROWSER Plugin and then use that content on a third party plugin?

I’m trying to map all scenarios :slight_smile: Sorry if I make dumb questions.

1 Like

A user could ask ChatGPT to summarize a URL with a given plug-in. That plug-in will scrape the site with its own code from its own server. There’s no global way to block specifically ChatGPT plug-in traffic (or even identify it).

1 Like

When you said ‘its own code from its own server’ are you saying a third party or the OpenAI browser plugin?

Bottom line, my concern is if we ALLOW the ip address range from OPEN AI ChatGPT browser plugin, could that be used to scrap the site and source a third party plugin?

1 Like

I see what you’re asking. In theory I guess they could use the web browsing plugin on your site for a third-party plugin. It doesn’t really make sense though, the user would need to use the web browsing plugin with the other third-party plugin. It’d be the same as the user just copying and pasting directly from your website. As mentioned, any third-party plugin is just an API service which can be hard to catch.

They could just simply scrape the website using traditional tools and it would be much more effective. So no; it’s not practical, or even effective/possible.

1 Like

It may be helpful to take a step back and look at exactly what you are concerned about. Can a user use ChatGPT to grab a single page from your website and operate on it with GPT or another third-party plugin? Yes. But they can do this now without ChatGPT or Plugins just by copy/pasting.

Does ChatGPT “learn” from this browsed content and have it available as part of it’s longterm memory? No.

Can a user use ChatGPT to scrape your entire website in one command? No.

Are you concerned about GPT being trained on your websites data? That’s a whole other topic.

2 Likes

Thank you for all the answers. The points is, in an environment in which we have BOT CONTROL security running, the chances to get out sites scraped are very low. While is not impossible it’s hard. We have also disabled copy/paste for end users. So we don’t want to just to ahead an open an attack vector for bad actors to take advantage.

My understanding of this exchange are:

  1. allowing OPENAI browser plugins does not open the same feature for third parties
  2. the browser plugin will behave like an end user, so it can not scrap an entire site
  3. using the OPENAI browser plugin to source information to a third party plugin is possible but not effective / is expensive

Am I correct?

1 Like

Standard anti-web-scraping practices can be applied, also here is an article that describes what you can do: How to Block ChatGPT From Using Your Website Content

  1. Yes. Third-party plugins do not use the Web Browsing plugin. If they are retrieving content from your server, they are scraping it using their own tools & servers

  2. Yes. The purpose of the web browsing plugin is to find and deliver information, not to scrape webpages. You can control what it is allowed to see, or even completely change the content.

  3. It wouldn’t happen, and shouldn’t be a concern.

No matter what you do, it’s always very easy to scrape a website.
Completely opinionated but more and more people are relying on LLM’s for information. Blocking them will be detrimental. It’d almost be like blocking the Google web crawler. Technically it does also scrape your website, but it does so to deliver results for people to find your content. If you block the ChatGPT user agent outright, then other news sources will be getting the traffic instead.

2 Likes
  1. Yes - but be aware - anyone can set their bot’s user agent to the same as OpenAI’s. So you would need to check the IP-Address range as well - but that is probably over kill.
1 Like

Probably the right question they should be asking is how they can work with OpenAI to monetize it (related to point 5) but I do understand the concern. If you follow your point (3), any information these news sites are choosing to block is only harming their business. Any traffic you see from ChatGPT UI is the plugin driving people to their sites through links. That means these news sites are throwing away money when they block the content.

First, a plugin is merely a description of remote API that ChatGPT UI is learning how to use. The plugin is an interface to a remote service where everything runs. This is outside of OpenAI and its IP range. These 3rd parties are the source of many bots.

Second, the UI will display images if it is given the URL by the 3rd-party plugin. The UI will also display links with meta-data, which means it is pulling meta-data tags from the news sites. This would be useful in content moderation. I can’t say what OpenAI does with this meta-data. If you block anything, this is what you would be blocking.

1 Like

I know this is an old thread now, though most requests from scrapers are simply going to appear as a standard URL web request (therefore it could be human or machine) to call up the information, after which it will likely be looped through to find data, dates, etc, converted to something like JSON or XML formatting into a file which can then be looped through. The chances of you actually being able to stop/block this kind of thing are pretty remote without having that “sensitive” information behind a login/subscription model.

The reality is also that if it’s in the “public domain”, it’s accessible to both AI and human. If you’re letting a human access it as public domain information, why not also an AI?

If the website has advertising it’s expected that the time spent to create the layout, organize the material and provide the resources is traded by viewing these advertising.

Or, the page could be an SEO attempt to lead a visitor to a main service (a blog post for example)

AI visits provide no benefits. It takes resources & authority without giving anything back.

Not to worry though. Free information will slowly disappear and be locked behind APIs for your AI to consume and the video game equivalent of DRM will show up with more intrusive anti-bot security that ultimately hurts the average viewer.

ChatGPT hardly scrapes anything these days. Copilot on the other hand will literally scrape anything.