<a href="https://smidgeo.com/notes/deathmtn">deathmtn</a>

7/25/2024, 11:25:53 AM

The web scraper bot for Anthropic’s AI chatbot Claude hit iFixit’s website nearly a million times in a single day, despite the repair database having terms of service provisions that state “reproducing, copying or distributing any Content, materials or design elements on the Site for any other purpose, including training a machine learning or AI model, is strictly prohibited without the express prior written permission of iFixit.”

…

Wiens sent me server logs that showed thousands of requests per minute for a several hour period.

…

The web scraper bot for Anthropic’s AI chatbot Claude hit iFixit’s website nearly a million times in a single day, despite the repair database having terms of service provisions that state “reproducing, copying or distributing any Content, materials or design elements on the Site for any other purpose, including training a machine learning or AI model, is strictly prohibited without the express prior written permission of iFixit.”
iFixit CEO Kyle Wiens tweeted Wednesday “Hey @AnthropicAI: I get you're hungry for data. Claude is really smart! But do you really need to hit our servers a million times in 24 hours? You're not only taking our content without paying, you're tying up our devops resources. Not cool.”
Wiens sent me server logs that showed thousands of requests per minute for a several hour period. “We're just the largest database of repair information in the world, no big deal if they take it all without asking and swamp our servers in the process,” he told me, adding that iFixit’s website has millions of total pages. These include repair guides, revision histories for those guides, blogs, news posts, and research, forums, community-contributed repair guides and question-and-answer sections, etc.
This sort of scraping has become incredibly commonplace, and a recent study by the Data Provenance Institute shows that website owners are increasingly trying to signal to AI companies that they do not want their content scraped for the purpose of training commercial AI tools. Wiens said that iFixit modified its robots.txt file this week to specifically block Anthropic’s crawler bots.
This is particularly notable because, when I asked Anthropic about the fact that its bot hit iFixit a million times in a day, I was sent a blog post by the company that puts the onus on website owners to specifically block Anthropic’s crawler, called ClaudeBot.
“As per industry standard, Anthropic uses a variety of data sources for model development, such as publicly available data from the internet gathered via a web crawler,” the blog post reads. “Our crawling should not be intrusive or disruptive. We aim for minimal disruption by being thoughtful about how quickly we crawl the same domains and respecting Crawl-delay where appropriate.”
It’s probably time to stop bothering with robots.txt, which is a pre-Trump tradition based on social norms, and time to start configuring nginx to return 403 for AI scraper agents.

Of course, they’ll then start faking their agents, and we’re going to see an increase in approved-user-only web sites. RIP open web