This endpoint is intended solely for file downloads and is backed by an S3 bucket, not a traditional webpage meant for indexing. However, we’ve observed that GPTBot has been sending an excessive number of requests to this endpoint, which has led to a significant increase in our bandwidth usage and incurred substantial traffic costs. These requests are unnecessary and harmful, as they access non-HTML content and place an unreasonable load on our infrastructure.
Escalated to OpenAI.
(more words for the algorithm)
GPTBot user agent information is published at https://platform.openai.com/docs/bots, which you can use to restrict crawling in your robots.txt file. Alternatively you could email gptbot at openai.com and let us know which domain you’re referencing.
why is the crawler downloading binary files ?
Is this expected ?
Web crawlers generally just follow links and don’t know in advance what type of content is going to be served until they access it. If you’d like to disallow crawlers from visiting all or part of your site you can do so using the robots.txt file. OpenAI’s crawler user agents are listed here.
If you have additional questions or concerns about GPTBot please feel free to email us at gptbot (at) openai.com.