Not sure if anyone in the community has observed this, but recently GPTBot has been crawling my website pretty heavily, with most requests being truncated URLs that return a 404 response (i.e. if the page’s URL is /example-1/ GPTBot is requesting /examp in Cloudflare logs).
These URLs aren’t linked anywhere else on the site (or on other websites) and I’m curious if anyone else has seen this.
Thank you for the bug report and for the additional information you provided in DM. We have fixed a bug in GPTBot’s link extraction from large html text nodes.
It’s hammering some websites I maintain with requests every 2-3 seconds. That amounts to a denial-of-service attack. It seems to ignore robots.txt. So I’ve had to add as the first line in index.php:
GPTBot does respect robots.txt directives. Aside from simple misconfigurations, a common mistake is when sites (or their hosting providers) inadvertently fail to actually serve their robots.txt file to web crawlers. Please share more information, such as a domain name that you’re referring to, to help understand your issue. You can email gptbot (at) openai.com if you prefer not to share more info here.