social.kernel.org

Conversation

Jonathan Corbet

So @lwn is currently under the heaviest scraper attack seen yet. It is a DDOS attack involving tens of thousands of addresses, and that is affecting the responsiveness of the site, unfortunately.

There are many things I would like to do with my time. Defending LWN from AI shitheads is rather far from the top of that list. I *really* don't want to put obstacles between LWN and its readers, but it may come to that.

(Another grumpy day, sorry)

231

257

Tristan Colgate-McFarlane

tmcfarlane@toot.community

6 days ago

Reply to @corbet

Edited 6 days ago

@corbet @lwn this, combined with search engines prioritising the stolen content!
This is why I think the web is genuinely doomed. It's not enough to steal the content, for search engines to kill click thtoughs and ad revenues, they are literally killing the ability of original authors to serve the traffic to the few real users that might want to see it.
Devastating.

Allan Christoffersen

allan_christoffersen@discuss.systems

6 days ago

Reply to @corbet

@corbet @lwn
As a avid longtime subscriber and reader, I can only give thanks and hope you will survive also this blast of willfully wrong behaviour. Thank you for your openness.

nirik

nirik@fosstodon.org

6 days ago

Reply to @corbet

@corbet @lwn Been there. ;( #hugops

FoxyLad

foxylad@mastodon.nz

6 days ago

Reply to @corbet

@corbet @lwn Any inkling which AI (Arsehole Incorporated) it is? The crash can't come soon enough.

Jonathan Corbet

corbet

6 days ago

Reply to @foxylad@mastodon.nz

@foxylad @lwn There is no way to know who is after the data. The actual attack is likely perpetrated by Bright Data or one of its equally vile competitors.

sb arms & legs

sb@metroholografix.ca

6 days ago

Reply to @corbet

@corbet @lwn
Just speaking with my user hat on here, but given the circumstances I don't mind the ever-so-slight inconvenience of an #anubus challenge.

Xe

cadey@pony.social

6 days ago

Reply to @corbet

@corbet @lwn If you need help, email me. I can work with you in case there's low hanging fruit that you missed.

wuffel

wuffel@social.tchncs.de

6 days ago

Reply to @corbet

@corbet @lwn

This helped me a lot with my little projects:
https://codeberg.org/skewray/htaccess

sheep

sheep@mstdn.party

6 days ago

Reply to @corbet

@corbet @lwn Obviously that sucks, but I am super happy with the RSS integration that I get with my lwn subscription. People who are affected by the outage should check that out. Not really a solution, but maybe part of one.

Ayush Agarwal (आयुष अग्रवाल)

ayushnix@social.ayushnix.com

6 days ago

Reply to @corbet

@corbet @lwn I'm not sure how people in the kernel community reconcile using LLMs with the effect these LLMs have on small businesses and individuals hosting their websites for fun and it's not as if the kernel community itself isn't affected by these incessant DDoS attacks.

Riku Voipio

suihkulokki@society.oftrolls.com

6 days ago

Reply to @corbet

@corbet @lwn subscriber.lwn.net that is only available for subscribers. One can either join the que with AI bots for lwn.net or subscribe and enjoy the snappy subscriber server.

I mean that's not a great solution, but it's the only one that works.

Peter Kovář

1div0@mastodon.social

6 days ago

Reply to @corbet

@corbet @lwn 100% PURE BAISAIRDS!

Alonely0 🦀 🇪🇺

Alonely0@mastodon.social

6 days ago

Reply to @corbet

Edited 6 days ago

@corbet @lwn at this point we might as well be offensive. If the client seems even slightly sus, just send them gibberish data talking about how good Chihuahua muffins are. Ideally LLM-generated (yes, gross) because this doesn't add new information (linear algebra yay) and makes models collapse (aka AI inbreeding).

der eazy

eazy@chaos.social

6 days ago

Reply to @corbet

@corbet feel you. Same with my Podcast Directory

Dec [{()}]

dec23k@mastodon.ie

6 days ago

Reply to @cadey@pony.social

@cadey @corbet @lwn
I recently saw a traffic spike to a small HTML-only website that never had WP on it, but was suddenly getting failed wp-admin logins and hundreds of PHP vuln scans, non stop. All from MSFT IP addresses. Abuse reports were sent, but there was no response, and the abuse kept happening.
So now I'm blocking every MSFT CIDR block that I can find, server-wide.

Light Owl ☸️

sbb@c.im

6 days ago

Reply to @corbet

@corbet @lwn I've been experiencing about 20x more website traffic than normal, myself. It's very likely this scraper bot traffic as well. Things are holding, but only because I took pains to use static site generation (absolutely minimal Javascript, designed to be lightweight).

Jonathan Corbet

corbet

6 days ago

Reply to @suihkulokki@society.oftrolls.com

@suihkulokki @lwn The problem with that solution is that it may well make it harder for us to bring in new subscribers, which is something we definitely want to do. First impressions matter, so giving new folks a poor experience seems ... not great.

It may yet come to that, though.

Jani Nikula

jani@floss.social

6 days ago

Reply to @corbet

@corbet @lwn @suihkulokki

Maybe it doesn't need to be subscriber only, just registered users only? Which can also be a PITA, but if there's no enshittification for non-registered users other than the bandwidth being shared with bots, maybe it's tolerable? Could even have a banner about this explaining the benefits of registering, and how LWN won't sell your data.

Jonathan Corbet

corbet

6 days ago

Reply to @jani@floss.social

@jani @lwn @suihkulokki Such things have crossed our minds, certainly. The gotcha there is that we've already had troubles with bots creating accounts; I don't think they would hesitate to do more of that if that would improve their access.

That and, of course, the fact that everybody starts as an unregistered user. As long as we can avoid making the experience worse for them, I think we should.

Jani Nikula

jani@floss.social

6 days ago

Reply to @corbet

@corbet @lwn @suihkulokki

Yeah, it's hard to argue against that.

And maybe you weren't seeking for "helpful" advice anyway, but, uh, you know your audience. :)

Jonathan Corbet

corbet

6 days ago

Reply to @jani@floss.social

@jani @lwn @suihkulokki Suggestions are much appreciated! It's not as if we've figured all this stuff out...

Riku Voipio

suihkulokki@society.oftrolls.com

6 days ago

Reply to @corbet

@corbet @lwn The "harder to onboard new users" part is certainly one reason why that solution isn't great

I just don't really see anything else working long term. Everything else is just kind whack-a-mole where the mole keeps getting more clever.

Martin Roukala (né Peres)

mupuf@social.treehouse.systems

5 days ago

Reply to @corbet

@corbet @lwn @jani @suihkulokki I have a simple solution: Stop being so damn relevant!!!

Wait... 🤡

Jani Nikula

jani@floss.social

5 days ago

Reply to @mupuf@social.treehouse.systems

@mupuf @corbet @lwn @suihkulokki

I don't think the scrapers care about that, though.

Martin Roukala (né Peres)

mupuf@social.treehouse.systems

5 days ago

Reply to @jani@floss.social

@jani @corbet @lwn @suihkulokki Sorry, I was being too optimistic... I was thinking they wanted sources with high SNR... But you are probably right...

AUSTRALOPITHECUS 🇺🇦🇨🇿

lkundrak@metalhead.club

5 days ago

Reply to @corbet

@corbet @lwn @jani @suihkulokki one day the photocopiers will get busy after the office hours again, but this time it's going to be linux weekly news instead of the punk fanzines

zaire the anonymous conspiracy theorist

zaire@fedi.absturztau.be

5 days ago

Reply to @corbet

@corbet @lwn iocaine?

David Gerard

davidgerard@circumstances.run

5 days ago

Reply to @corbet

@corbet @lwn @jani @suihkulokki for RationalWiki I've had to resort to a mandatory JavaScript trick that sets a cookie. Unfortunately it seems to block Googlebot, but it's down to (a) human users can use the site (b) nobody can use the site including Googlebot.

Tobias

krono@toot.berlin

5 days ago

Reply to @dec23k@mastodon.ie

@dec23k @cadey @corbet @lwn I have a tiny site. SSH moved to a different port.
I see hundreds of locing attempts a day. I receive a summary via logwatch. Nearly every day I'm blocking whole /24 or even /16.

I rely on fail2ban to mitigate suche Webserver DDOS, but maybe thats not enough.
How do you detect those spikes?

Ben Tasker

ben@mastodon.bentasker.co.uk

5 days ago

Reply to @sbb@c.im

@sbb @corbet @lwn Yep, I've also had a big bump in traffic over the last couple of months (despite levels already having been elevated because of AI scraper activity).

Happily though, it looks like a lot of them have been falling into my LLM tarpit.

I think those figures are under-reporting too - I've also seen a significant rise in the number of 5xx status codes, suggesting my tarpit container might not be keeping up

Jeffrey Haas

jhaas@a2mi.social

5 days ago

Reply to @dec23k@mastodon.ie

@dec23k @cadey @corbet @lwn Some days, the answer is to hit a BGP looking glass and just block every prefix from the origin AS of that service provider.

trademark

trademark@fosstodon.org

5 days ago

Reply to @corbet

@corbet Instead of making it worse for unregistered users, how about sharding the site with recent frequently accessed content separate from old long-tail? Easier to keep a small site in cache.

Jonathan Corbet

corbet

5 days ago

Reply to @trademark@fosstodon.org

@trademark Making things worse for real users is something we have gone far out of our way to avoid. I'm not sure that sharding in that way would help much, though; cache isn't really the problem.

trademark

trademark@fosstodon.org

5 days ago

Reply to @corbet

@corbet Can you say what the problem is? CPU, contention in unshardeable database?

Jonathan Corbet

corbet

5 days ago

Reply to @trademark@fosstodon.org

@trademark CPU primarily when things get really crazy. More CPU is easily arranged, of course, but it is irritating as hell to have to pay for that to feed our hard-written articles to those people.

trademark

trademark@fosstodon.org

4 days ago

Reply to @corbet

@corbet Yeah, you're running python code? Other option is to try a JIT or Cython, but if you value your time it might be more expensive to debug that than to pay for the extra CPU :(

mx alex tax1a - 2020 (6)

atax1a@infosec.exchange

4 days ago

Reply to @dec23k@mastodon.ie

@dec23k @cadey @corbet @lwn @davidgerard we have a program that hooks into nginx on one end and freebsd blocklistd on the other and when someone hits up PHP endpoints on our sites that havent had PHP on them in over 20 years, they get firewalled. sure as shit, most of them come from azure. also we host our own email and most of the spam that makes it through our filters comes from google. makes u think dot jpeg

David Gerard

davidgerard@circumstances.run

4 days ago

Reply to @atax1a@infosec.exchange

Edited 4 days ago

@atax1a @dec23k @cadey @corbet @lwn my hazard keeps being Ali and particularly their Singapore region, which I gave an automatic 503 to the whole /16 for

✧✦Catherine✦✧

whitequark@social.treehouse.systems

4 days ago

Reply to @dec23k@mastodon.ie

@dec23k @cadey @corbet @lwn grebedoc.dev gets these on a hourly basis (I solve it by just serving the 404s, it's not a problem for my server software)

Jonathan Corbet

Tristan Colgate-McFarlane

Allan Christoffersen

nirik

FoxyLad

Jonathan Corbet

sb arms & legs

Xe

wuffel

sheep

Ayush Agarwal (आयुष अग्रवाल)

Riku Voipio

Peter Kovář

Alonely0 🦀 🇪🇺

der eazy

Dec [{()}]

Light Owl ☸️

Jonathan Corbet

Jani Nikula

Jonathan Corbet

Jani Nikula

Jonathan Corbet

Riku Voipio

Martin Roukala (né Peres)

Jani Nikula

Martin Roukala (né Peres)

AUSTRALOPITHECUS 🇺🇦🇨🇿

zaire the anonymous conspiracy theorist

David Gerard

Tobias

Ben Tasker

Jeffrey Haas

trademark

Jonathan Corbet

trademark

Jonathan Corbet

trademark

mx alex tax1a - 2020 (6)

David Gerard

✧✦Catherine✦✧

Terms of service

Privacy notice

Getting your own account