Conversation

Jonathan Corbet

The amount of LWN traffic that is just AI bots downloading the same stuff over and over again is staggering; it's increasingly hard to see any sort of human signal in there at all. Something is going to give here at some point...

https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/
10
76
72
@corbet 100% agreed -- I estimate upwards of 95% of all traffic to kernel.org services are greedy llm bots who operate in complete disregard of robots.txt.
1
9
7

@corbet yep. Seeing this in fedora land too. ;(

0
0
0

@corbet Disgusting behavior from these companies.

0
0
0

@corbet "In June 2024, someone used Facebook's content downloader to download 10 TB of data, mostly Zipped HTML and PDF files. We tried to email Facebook about it with the contact information listed in the bot's user agent, but the email bounced."
Typical facebook

0
1
1
@corbet This stuff is like a pros/cons comparison, only the 'pros' column is completely empty
0
0
3

@corbet I've black-holed four /16s so far for a site I run just for one evil LLM scraper... 😭

I've also blocked six (so far) ASNs but that has been for different behavior. But I'll happily add them if I need to.

1
0
0
@mcdanlj I've attempted such things, but the problem with playing whac-a-mole games is that there is an infinite supply of moles. I've found I can block a lot of subnets and not really make a dent in the problem.

I've started to wonder if some of these people aren't renting botnets to get around blocking, rate limiting, and other defenses.
1
0
2

@corbet I suspect you are right. It would match their overall ethical profile as far as I can tell.

I'm one of those people who has gotten some shade from parts of the community for thinking that it is possible to train an LLM unilaterally and ethically. (I was also unpopular long ago for thinking that publishing a CD archive of Usenet posts was ok because it was just a different transmission mechanism, and we didn't complain about changing from UUCP to NNTP. I had that opinion including when it applied to my own work.)

But I'm not completely convinced that we yet have an existence proof for unilateral LLM scraping actually being done in a way that strikes me as ethical. And the DoS attacks we are seeing here make me angry.

0
0
0

@monsieuricon @corbet Are these using a proper user agent or are they disguising themselves?

1
0
0
@gromit @corbet I've seen both -- big corps will stick themselves into the user agent, but I see plenty of what is obviously bot traffic that uses generic browser strings, with both outdated and recent release versions in them.
1
0
0

@corbet Have you considered making older parts of the LWN archive available only to subscribers?

1
0
0
@krans I think it would be a terrible mistake for LWN to do that. LWN is part of the history of the community; walling that off would damage both LWN and (I like to think, at least) the community badly.
0
0
2

@monsieuricon @corbet Yeah just asking because we're facing similar issues with the Arch Wiki, additional to some malicious requests (ddos via expensive requests) its not a nice combination 😔

0
0
0

@corbet
For the Arch Linux wiki we also have had to put in some measures to keep up with the increase of activity. We aren't sure if this is due to AI but it feels likely as archlinux.org suffers from the same traffic increase.

0
0
0

@corbet I hope the crawlers respect robots.txt rules.

In that case, you can try something like this:

https://www.nayab.dev/robots.txt

1
0
0
@basha Most of the crawlers do not even identify themselves as such; they will not consult the robots.txt file. There are no solutions to be found there, unfortunately.
0
0
0