social.kernel.org

Conversation

Jonathan Corbet

The amount of LWN traffic that is just AI bots downloading the same stuff over and over again is staggering; it's increasingly hard to see any sort of human signal in there at all. Something is going to give here at some point...

https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/

K. Ryabitsev 🍁

monsieuricon

1 year ago

Reply to @corbet

@corbet 100% agreed -- I estimate upwards of 95% of all traffic to kernel.org services are greedy llm bots who operate in complete disregard of robots.txt.

Luis Villa

luis_in_brief@social.coop

1 year ago

Reply to @corbet

@corbet yeahhhhh https://www.404media.co/anthropic-ai-scraper-hits-ifixits-website-a-million-times-in-a-day/

nirik

nirik@fosstodon.org

1 year ago

Reply to @corbet

@corbet yep. Seeing this in fedora land too. ;(

marc

marcc@fosstodon.org

1 year ago

Reply to @corbet

Edited 1 year ago

@corbet Disgusting behavior from these companies.

Wyatt (🏳️‍⚧️♀?)

wyatt8740@tech.lgbt

1 year ago

Reply to @corbet

@corbet "In June 2024, someone used Facebook's content downloader to download 10 TB of data, mostly Zipped HTML and PDF files. We tried to email Facebook about it with the contact information listed in the bot's user agent, but the email bounced."
Typical facebook

ljs

1 year ago

Reply to @corbet

@corbet This stuff is like a pros/cons comparison, only the 'pros' column is completely empty

Michael K Johnson

mcdanlj@social.makerforums.info

1 year ago

Reply to @corbet

@corbet I've black-holed four /16s so far for a site I run just for one evil LLM scraper... 😭

I've also blocked six (so far) ASNs but that has been for different behavior. But I'll happily add them if I need to.

Jonathan Corbet

corbet

1 year ago

Reply to @mcdanlj@social.makerforums.info

@mcdanlj I've attempted such things, but the problem with playing whac-a-mole games is that there is an infinite supply of moles. I've found I can block a lot of subnets and not really make a dent in the problem.

I've started to wonder if some of these people aren't renting botnets to get around blocking, rate limiting, and other defenses.

Michael K Johnson

mcdanlj@social.makerforums.info

1 year ago

Reply to @corbet

@corbet I suspect you are right. It would match their overall ethical profile as far as I can tell.

I'm one of those people who has gotten some shade from parts of the community for thinking that it is possible to train an LLM unilaterally and ethically. (I was also unpopular long ago for thinking that publishing a CD archive of Usenet posts was ok because it was just a different transmission mechanism, and we didn't complain about changing from UUCP to NNTP. I had that opinion including when it applied to my own work.)

But I'm not completely convinced that we yet have an existence proof for unilateral LLM scraping actually being done in a way that strikes me as ethical. And the DoS attacks we are seeing here make me angry.

Christian Heusel

gromit@chaos.social

1 year ago

Reply to @monsieuricon

@monsieuricon @corbet Are these using a proper user agent or are they disguising themselves?

K. Ryabitsev 🍁

monsieuricon

1 year ago

Reply to @gromit@chaos.social

Edited 1 year ago

@gromit @corbet I've seen both -- big corps will stick themselves into the user agent, but I see plenty of what is obviously bot traffic that uses generic browser strings, with both outdated and recent release versions in them.

Peter Brett

krans@mastodon.me.uk

1 year ago

Reply to @corbet

@corbet Have you considered making older parts of the LWN archive available only to subscribers?

Jonathan Corbet

corbet

1 year ago

Reply to @krans@mastodon.me.uk

@krans I think it would be a terrible mistake for LWN to do that. LWN is part of the history of the community; walling that off would damage both LWN and (I like to think, at least) the community badly.

Christian Heusel

gromit@chaos.social

1 year ago

Reply to @monsieuricon

@monsieuricon @corbet Yeah just asking because we're facing similar issues with the Arch Wiki, additional to some malicious requests (ddos via expensive requests) its not a nice combination 😔

jelly

jelly@dodgy.download

1 year ago

Reply to @corbet

@corbet
For the Arch Linux wiki we also have had to put in some measures to keep up with the increase of activity. We aren't sure if this is due to AI but it feels likely as archlinux.org suffers from the same traffic increase.

Nayab Sayed

basha@social.nayab.dev

1 year ago

Reply to @corbet

@corbet I hope the crawlers respect robots.txt rules.

In that case, you can try something like this:

https://www.nayab.dev/robots.txt

Jonathan Corbet

corbet

1 year ago

Reply to @basha@social.nayab.dev

@basha Most of the crawlers do not even identify themselves as such; they will not consult the robots.txt file. There are no solutions to be found there, unfortunately.

About social.kernel.org

Terms of service

Please do not use this service in violation of the Linux Kernel Code of Conduct. Doing so will result in your account suspension with the referral of the matter to the CoC committee.
"Repeating"/"boosting" someone else's status on this platform will be treated as endorsement and will fall under rule #1.
You are encouraged to use this platform to promote your work on the Linux Kernel, but there is no restriction on permitted topics (with the exception of anything covered by #1 above).
There is no requirement to post in English, but it should be considered the primary language of communication on this platform.

Privacy notice

The admins of this service have access to all posted statuses. They aren't looking, but if it's something they shouldn't know about, then you should not post it on this platform.

Please see the Linux Foundation Privacy Policy, which applies to this platform as well.

Getting your own account

If you would like an account on this instance, please check that the following applies to you:

You are listed in MAINTAINERS or CREDITS
OR: You have a kernel.org account or email address
OR: You have a long and established history of involvement with the Linux Kernel

If the above is true and you agree with the Terms of Service and Privacy Notice listed above, please use these instructions to request an account:

How to request an account on social.kernel.org