@corbet Disgusting behavior from these companies.
@corbet "In June 2024, someone used Facebook's content downloader to download 10 TB of data, mostly Zipped HTML and PDF files. We tried to email Facebook about it with the contact information listed in the bot's user agent, but the email bounced."
Typical facebook
@corbet I've black-holed four /16s so far for a site I run just for one evil LLM scraper... 😭
I've also blocked six (so far) ASNs but that has been for different behavior. But I'll happily add them if I need to.
@corbet I suspect you are right. It would match their overall ethical profile as far as I can tell.
I'm one of those people who has gotten some shade from parts of the community for thinking that it is possible to train an LLM unilaterally and ethically. (I was also unpopular long ago for thinking that publishing a CD archive of Usenet posts was ok because it was just a different transmission mechanism, and we didn't complain about changing from UUCP to NNTP. I had that opinion including when it applied to my own work.)
But I'm not completely convinced that we yet have an existence proof for unilateral LLM scraping actually being done in a way that strikes me as ethical. And the DoS attacks we are seeing here make me angry.
@monsieuricon @corbet Are these using a proper user agent or are they disguising themselves?
@corbet Have you considered making older parts of the LWN archive available only to subscribers?
@monsieuricon @corbet Yeah just asking because we're facing similar issues with the Arch Wiki, additional to some malicious requests (ddos via expensive requests) its not a nice combination 😔
@corbet
For the Arch Linux wiki we also have had to put in some measures to keep up with the increase of activity. We aren't sure if this is due to AI but it feels likely as archlinux.org suffers from the same traffic increase.
@corbet I hope the crawlers respect robots.txt rules.
In that case, you can try something like this: