I suddenly noticed hundreds of errors on the log file for a website I created that are all caused by openai bots as the HTTP_FROM says gptbot(at)openai.com and USER_AGENT is Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)
I looked into one example and it seems the path created goes way out of bound. This is the path /Odonto-lieyectoresli-541.aspx/assets/js/plugins/Docs/Productos/assets/js/Docs/Productos/assets/js/assets/js/assets/js/vendor/images2021/Docs/Productos/Docs/Productos/assets/js/vendor/Docs/Productos/Docs/Productos/Docs/Menu/Odonto-Gomas-para-pulido-de-composite-tipo-Enhace-1815.aspx
Before blocking the bot I am writing to find out what’s wrong with my website.
Same problem here, with very long URLS with infinite number of &:
20.171.206.213 - - [29/Oct/2024:10:56:16 +0100] “GET /298-bois-chene-moderne?amp%3Bamp%3Border=product.price.desc&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bq=Style-Design&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bq=Style-Design&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bq=Style-Classique+Chic&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bq=Style-Design&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bq=Style-Design&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bq=Style-Design&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bq=Style-Classique+Chic&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bq=Style-Design&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bq=Style-Design&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bq=Style-Design&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bq=Style-Classique+Chic&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bq=Style-Design&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bq=Style-Design&%3Bamp%3Bq=Style-Design&%3Border=product.name.desc&order=product.name.desc HTTP/1.1” 200 25215 “-” “Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)”
Same here, I took the time to come here and report the problem, but I fear chatgpt developpers wont read us, there is no “BUG REPORT your AI is broken , please fix this” ticketing system . . . If we have no answer qite fast, I ll also block OpenAI . robots.txt or better . . . firewall dropping the whole IPrange NetRange: 20.160.0.0 - 20.175.255.255
CIDR: 20.160.0.0/12
NetName: MSFT
In the end I had no choice , those gptbot attacks were overloading my dedicated server, so I had to reject all gptbot requests :
From my side all errors reported above are not showing up anymore. I don’t know if something was fixed or if the crawler isn’t coming around anymore
I have the same problem - GPTBot/1.2 is hitting nonexistent endpoints on my site thousands of times each day, filling up my logs and depleting my Rollbar credits.
whois says that block is owned by Microsoft, so I suspect OpenAI is running the bot on Azure:
NetRange: 20.160.0.0 - 20.175.255.255
CIDR: 20.160.0.0/12
NetName: MSFT
NetHandle: NET-20-160-0-0-1
Parent: NET20 (NET-20-0-0-0-0)
NetType: Direct Allocation
OriginAS:
Organization: Microsoft Corporation (MSFT)
RegDate: 2017-02-22
Updated: 2017-02-22
Apparently GPTBot scrapes that URL and tries to HTTP “GET” it, which is meaningless in the context of the application. My Rails server was serving a 404 for it.
I’m not clear on why the bot decided to slam those forms with thousands of requests
Still having some issues with GBTBot that is findind none existing URLs. The graphic of this 11 years old website shows logged requests, the spike is basically due to gptbot
I am hosting about 600MB of files on a VPS. Usually, the traffic is very low, no more than a few 100MB per day.
However, last month, GPTBots started to crawl violently the files that I hosted, resulting in 30TB of traffic, which is equivalent to loading the entirety of my files 50,000 times. I did not find out about it until I received the invoice from my hosting today.