See my follow-up post. Not sure if this is documented anywhere. You could probably find it out by setting up a “honeypot” website and directing ChatGPT with the browser plugin enabled to go to your website. Third-party plugins may not follow any of the rules that OpenAI have set for themselves.
Third-party plugins are hosted by the third party, so if they call out to the internet they will be coming from random IP addresses. For example, my plugin is hosted on AWS lambda and there are a variety of different IP addresses that it uses.
Use the same techniques that you would use to block any other unauthorised scraping of your websites.
If you can detect that someone is scraping your site then you could show them different content.
No, the only information you would get is that something has accessed the page.
With things like headless browsers, it’s very difficult to tell the difference between a real person and a machine browsing your website.
To respect content creators and adhere to the web’s norms, our browser plugin’s user-agent token is ChatGPT-User and is configured to honor websites’ robots.txt files. This may occasionally result in a “click failed” message, which indicates that the plugin is honoring the website’s instruction to avoid crawling it. This user-agent will only be used to take direct actions on behalf of ChatGPT users and is not used for crawling the web in any automatic fashion. We have also published our IP egress ranges. Additionally, rate-limiting measures have been implemented to avoid sending excessive traffic to websites.
If you’re concerned about scraping, there’s really two areas of concern:
ChatGPT with Browsing models, which as @iamflimflam1 said, should be straightforward to restrict (either through robots.txt or IP blocking)
Plugins that do their own browsing, like WebPilot or others. These plugins are making requests from their servers and their own code, so up to each plugin to respect robots.txt. Theoretically this is something OpenAI could have policies around for plugin approval, but realistically hard for them to enforce.
Hi, I saw that document. My concern if for third party plugins. Can a user instruct chatgpt to get the content using the CHATGPT OPEN AI BROWSER Plugin and then use that content on a third party plugin?
I’m trying to map all scenarios Sorry if I make dumb questions.
A user could ask ChatGPT to summarize a URL with a given plug-in. That plug-in will scrape the site with its own code from its own server. There’s no global way to block specifically ChatGPT plug-in traffic (or even identify it).
I see what you’re asking. In theory I guess they could use the web browsing plugin on your site for a third-party plugin. It doesn’t really make sense though, the user would need to use the web browsing plugin with the other third-party plugin. It’d be the same as the user just copying and pasting directly from your website. As mentioned, any third-party plugin is just an API service which can be hard to catch.
They could just simply scrape the website using traditional tools and it would be much more effective. So no; it’s not practical, or even effective/possible.
It may be helpful to take a step back and look at exactly what you are concerned about. Can a user use ChatGPT to grab a single page from your website and operate on it with GPT or another third-party plugin? Yes. But they can do this now without ChatGPT or Plugins just by copy/pasting.
Does ChatGPT “learn” from this browsed content and have it available as part of it’s longterm memory? No.
Can a user use ChatGPT to scrape your entire website in one command? No.
Are you concerned about GPT being trained on your websites data? That’s a whole other topic.
Thank you for all the answers. The points is, in an environment in which we have BOT CONTROL security running, the chances to get out sites scraped are very low. While is not impossible it’s hard. We have also disabled copy/paste for end users. So we don’t want to just to ahead an open an attack vector for bad actors to take advantage.
My understanding of this exchange are:
allowing OPENAI browser plugins does not open the same feature for third parties
the browser plugin will behave like an end user, so it can not scrap an entire site
using the OPENAI browser plugin to source information to a third party plugin is possible but not effective / is expensive
Yes. Third-party plugins do not use the Web Browsing plugin. If they are retrieving content from your server, they are scraping it using their own tools & servers
Yes. The purpose of the web browsing plugin is to find and deliver information, not to scrape webpages. You can control what it is allowed to see, or even completely change the content.
It wouldn’t happen, and shouldn’t be a concern.
No matter what you do, it’s always very easy to scrape a website.
Completely opinionated but more and more people are relying on LLM’s for information. Blocking them will be detrimental. It’d almost be like blocking the Google web crawler. Technically it does also scrape your website, but it does so to deliver results for people to find your content. If you block the ChatGPT user agent outright, then other news sources will be getting the traffic instead.
Probably the right question they should be asking is how they can work with OpenAI to monetize it (related to point 5) but I do understand the concern. If you follow your point (3), any information these news sites are choosing to block is only harming their business. Any traffic you see from ChatGPT UI is the plugin driving people to their sites through links. That means these news sites are throwing away money when they block the content.
First, a plugin is merely a description of remote API that ChatGPT UI is learning how to use. The plugin is an interface to a remote service where everything runs. This is outside of OpenAI and its IP range. These 3rd parties are the source of many bots.
Second, the UI will display images if it is given the URL by the 3rd-party plugin. The UI will also display links with meta-data, which means it is pulling meta-data tags from the news sites. This would be useful in content moderation. I can’t say what OpenAI does with this meta-data. If you block anything, this is what you would be blocking.