AI crawlers run up a website's hosting bill

Read the Docs, a third-party documentation hosting service has found its bandwidth bill skyrocketing due to AI crawlers.

Web crawlers for search have long used a surprising amount of bandwidth and sometimes CPU resources. Website owners, however, want these crawls in most cases because they are how discovery now works on the Internet. AI crawlers training themselves on the content, however, really doesn't benefit the site owner and, in this case, costs them a lot of money.

AI crawlers are acting in a way that is not respectful to the sites they are crawling, and that is going to cause a backlash against AI crawlers in general. As a community-supported site without a large budget, AI crawlers have cost us a significant amount of money in bandwidth charges, and caused us to spend a large amount of time dealing with abuse.

AI crawler abuse

We have been seeing a number of bad crawlers over the past few months, but here are a couple illustrative examples of the abuse we're seeing:

73 TB in May 2024 from one crawler

One crawler downloaded 73 TB of zipped HTML files in May 2024, with almost 10 TB in a single day. This cost us over $5,000 in bandwidth charges, and we had to block the crawler. We emailed this company, reporting a bug in their crawler, and we're working with them on reimbursing us for the costs.

Read the Docs

Previously:
Essays explore the hellscape of freelance AI model training
Nightshade: a new tool artists can use to 'poison' AI models that scrape their online work
Keep A.I. weird!
Video game actors strike over AI