GPTBot makes over 10000 request to my website

This endpoint is intended solely for file downloads and is backed by an S3 bucket, not a traditional webpage meant for indexing. However, we’ve observed that GPTBot has been sending an excessive number of requests to this endpoint, which has led to a significant increase in our bandwidth usage and incurred substantial traffic costs. These requests are unnecessary and harmful, as they access non-HTML content and place an unreasonable load on our infrastructure.

Escalated to OpenAI.

(more words for the algorithm)

GPTBot user agent information is published at https://platform.openai.com/docs/bots, which you can use to restrict crawling in your robots.txt file. Alternatively you could email gptbot at openai.com and let us know which domain you’re referencing.

1 Like

why is the crawler downloading binary files ?
Is this expected ?

1 Like

Web crawlers generally just follow links and don’t know in advance what type of content is going to be served until they access it. If you’d like to disallow crawlers from visiting all or part of your site you can do so using the robots.txt file. OpenAI’s crawler user agents are listed here.

If you have additional questions or concerns about GPTBot please feel free to email us at gptbot (at) openai.com.

1 Like

The user agent list link is broken.

GPTBot is hammering our staging site with a request every 3 seconds. It’s found a calender page and keeps making requests with different months and years in the URL parameters, some 800 years in the future. The staging site is not linked from anywhere and only discoverable from the DNS settings. Altogether the antithesis of intelligence.

I’ve disallowed it in robots.txt but the requests keep coming.

This is a denial-of-service attack and should be shut down, using appropriate law enforcement as necessary.

Aside from simple misconfigurations, a common mistake is when sites (or their hosting providers) inadvertently fail to actually serve their robots.txt file to web crawlers. Please share more information, such as a domain name that you’re referring to, to help understand your issue. As mentioned above, you can email gptbot (at) openai.com if you prefer not to share more info here.

1 Like

I’ve sent an email with server logs and domain info. I am quite sure the problem is your end not mine. Please fix it.

Thanks Jake, confirmed GPTBot is no longer crawling the site now that it’s serving us content for robots.txt. I’ve replied to your email with more detail.

I just want to add to this thread to say that 10k requests over a few hours to a day is a very normal level of activity for a well configured web crawler, GoogleBot does this kind of number on many of my sites and it’s a good thing, getting indexed for searching means your content is discoverable.

If this is your first website and you are alarmed at the number of connections in a day, these kinds of numbers are quite typical, a denial of service attack is many thousands in a matter of seconds, but even this is normal for a busy website.

My recommendation is to let the crawlers do their thing if you wish to be found when people ask AI and search engines for answers.

2 Likes