Conversation

Jonathan Corbet

So @lwn is currently under the heaviest scraper attack seen yet. It is a DDOS attack involving tens of thousands of addresses, and that is affecting the responsiveness of the site, unfortunately.

There are many things I would like to do with my time. Defending LWN from AI shitheads is rather far from the top of that list. I *really* don't want to put obstacles between LWN and its readers, but it may come to that.

(Another grumpy day, sorry)
15
231
257
Edited 6 days ago

@corbet @lwn this, combined with search engines prioritising the stolen content!
This is why I think the web is genuinely doomed. It's not enough to steal the content, for search engines to kill click thtoughs and ad revenues, they are literally killing the ability of original authors to serve the traffic to the few real users that might want to see it.
Devastating.

0
0
0

@corbet @lwn
As a avid longtime subscriber and reader, I can only give thanks and hope you will survive also this blast of willfully wrong behaviour. Thank you for your openness.

0
0
0

@corbet @lwn Any inkling which AI (Arsehole Incorporated) it is? The crash can't come soon enough.

1
0
0
@foxylad @lwn There is no way to know who is after the data. The actual attack is likely perpetrated by Bright Data or one of its equally vile competitors.
0
1
1

@corbet @lwn
Just speaking with my user hat on here, but given the circumstances I don't mind the ever-so-slight inconvenience of an challenge.

0
0
0

@corbet @lwn If you need help, email me. I can work with you in case there's low hanging fruit that you missed.

1
0
0

@corbet @lwn Obviously that sucks, but I am super happy with the RSS integration that I get with my lwn subscription. People who are affected by the outage should check that out. Not really a solution, but maybe part of one.

0
0
0

Ayush Agarwal (आयुष अग्रवाल)

@corbet @lwn I'm not sure how people in the kernel community reconcile using LLMs with the effect these LLMs have on small businesses and individuals hosting their websites for fun and it's not as if the kernel community itself isn't affected by these incessant DDoS attacks.

0
0
0

@corbet @lwn subscriber.lwn.net that is only available for subscribers. One can either join the que with AI bots for lwn.net or subscribe and enjoy the snappy subscriber server.

I mean that's not a great solution, but it's the only one that works.

1
0
0
Edited 6 days ago

@corbet @lwn at this point we might as well be offensive. If the client seems even slightly sus, just send them gibberish data talking about how good Chihuahua muffins are. Ideally LLM-generated (yes, gross) because this doesn't add new information (linear algebra yay) and makes models collapse (aka AI inbreeding).

0
0
0

@corbet feel you. Same with my Podcast Directory

0
0
0

@cadey @corbet @lwn
I recently saw a traffic spike to a small HTML-only website that never had WP on it, but was suddenly getting failed wp-admin logins and hundreds of PHP vuln scans, non stop. All from MSFT IP addresses. Abuse reports were sent, but there was no response, and the abuse kept happening.
So now I'm blocking every MSFT CIDR block that I can find, server-wide.

4
1
1

@corbet @lwn I've been experiencing about 20x more website traffic than normal, myself. It's very likely this scraper bot traffic as well. Things are holding, but only because I took pains to use static site generation (absolutely minimal Javascript, designed to be lightweight).

1
0
0
@suihkulokki @lwn The problem with that solution is that it may well make it harder for us to bring in new subscribers, which is something we definitely want to do. First impressions matter, so giving new folks a poor experience seems ... not great.

It may yet come to that, though.
2
0
1

@corbet @lwn @suihkulokki

Maybe it doesn't need to be subscriber only, just registered users only? Which can also be a PITA, but if there's no enshittification for non-registered users other than the bandwidth being shared with bots, maybe it's tolerable? Could even have a banner about this explaining the benefits of registering, and how LWN won't sell your data.

1
0
0
@jani @lwn @suihkulokki Such things have crossed our minds, certainly. The gotcha there is that we've already had troubles with bots creating accounts; I don't think they would hesitate to do more of that if that would improve their access.

That and, of course, the fact that everybody starts as an unregistered user. As long as we can avoid making the experience worse for them, I think we should.
3
0
4

@corbet @lwn @suihkulokki

Yeah, it's hard to argue against that.

And maybe you weren't seeking for "helpful" advice anyway, but, uh, you know your audience. :)

1
0
0
@jani @lwn @suihkulokki Suggestions are much appreciated! It's not as if we've figured all this stuff out...
2
0
1

@corbet @lwn The "harder to onboard new users" part is certainly one reason why that solution isn't great

I just don't really see anything else working long term. Everything else is just kind whack-a-mole where the mole keeps getting more clever.

0
0
0

@corbet @lwn @jani @suihkulokki I have a simple solution: Stop being so damn relevant!!!

Wait... 🤡

1
0
0

@mupuf @corbet @lwn @suihkulokki

I don't think the scrapers care about that, though.

1
0
0

@jani @corbet @lwn @suihkulokki Sorry, I was being too optimistic... I was thinking they wanted sources with high SNR... But you are probably right...

0
0
0

@corbet @lwn @jani @suihkulokki one day the photocopiers will get busy after the office hours again, but this time it's going to be linux weekly news instead of the punk fanzines

0
0
0

@corbet @lwn @jani @suihkulokki for RationalWiki I've had to resort to a mandatory JavaScript trick that sets a cookie. Unfortunately it seems to block Googlebot, but it's down to (a) human users can use the site (b) nobody can use the site including Googlebot.

0
0
0

@dec23k @cadey @corbet @lwn I have a tiny site. SSH moved to a different port.
I see hundreds of locing attempts a day. I receive a summary via logwatch. Nearly every day I'm blocking whole /24 or even /16.

I rely on fail2ban to mitigate suche Webserver DDOS, but maybe thats not enough.
How do you detect those spikes?

0
0
0

@sbb @corbet @lwn Yep, I've also had a big bump in traffic over the last couple of months (despite levels already having been elevated because of AI scraper activity).

Happily though, it looks like a lot of them have been falling into my LLM tarpit.

I think those figures are under-reporting too - I've also seen a significant rise in the number of 5xx status codes, suggesting my tarpit container might not be keeping up

0
0
0

@dec23k @cadey @corbet @lwn Some days, the answer is to hit a BGP looking glass and just block every prefix from the origin AS of that service provider.

0
0
0

@corbet Instead of making it worse for unregistered users, how about sharding the site with recent frequently accessed content separate from old long-tail? Easier to keep a small site in cache.

1
0
0
@trademark Making things worse for real users is something we have gone far out of our way to avoid. I'm not sure that sharding in that way would help much, though; cache isn't really the problem.
1
0
0

@corbet Can you say what the problem is? CPU, contention in unshardeable database?

1
0
0
@trademark CPU primarily when things get really crazy. More CPU is easily arranged, of course, but it is irritating as hell to have to pay for that to feed our hard-written articles to those people.
1
0
0

@corbet Yeah, you're running python code? Other option is to try a JIT or Cython, but if you value your time it might be more expensive to debug that than to pay for the extra CPU :(

0
0
0

@dec23k @cadey @corbet @lwn @davidgerard we have a program that hooks into nginx on one end and freebsd blocklistd on the other and when someone hits up PHP endpoints on our sites that havent had PHP on them in over 20 years, they get firewalled. sure as shit, most of them come from azure. also we host our own email and most of the spam that makes it through our filters comes from google. makes u think dot jpeg

1
0
0

@atax1a @dec23k @cadey @corbet @lwn my hazard keeps being Ali and particularly their Singapore region, which I gave an automatic 503 to the whole /16 for

0
0
0

@dec23k @cadey @corbet @lwn grebedoc.dev gets these on a hourly basis (I solve it by just serving the 404s, it's not a problem for my server software)

0
0
0