social.kernel.org

Conversation

Jonathan Corbet

corbet

The @lwn web site is currently under the most intense scraper attack I have seen yet. 1.3M unique IP addresses within the last couple of hours, and it's not done yet. The work we have done on defenses appears to be paying off, though; the server is holding up reasonably well — so far.

...just in case anybody wonders why I have a rather dim view of the whole AI industry...

16

388

554

warthog9

warthog9@social.afront.org

Reply to @corbet

@corbet @lwn if you need more capacity, give a shout, I've a bit i can throw to the cause if you need it

0

0

0

Bart Veldhuizen 🚀

BartV@mastodon.social

Reply to @corbet

@corbet @lwn same on @blenderartists right now with up to 3.7M requests/hour. Looks more like a DDoS in our case than AI scraping though (which I suspected at first)

2

0

0

Michael Cook

foobarsoft@mastodon.social

Reply to @corbet

@corbet @lwn I hate sites with only a paywall. I like open or your “open later” model far better.

But I’m surprised more sites haven’t gone paywall only at this point. How is anyone supposed to survive this?

(Longtime subscriber 👍)

0

0

0

Very Human Robot

StompyRobot@mastodon.gamedev.place

Reply to @corbet

@corbet @lwn
What's the user agent if the scrapers?
Are you using any captcha or cloud armor?

2

0

0

Chris Samuel

chris_bloke@mastodon.acm.org

Reply to @corbet

@corbet @lwn ouch - fingers crossed for you all!

I wonder if this is related to weather.gov going down earlier (and forecast.weather.gov still being down), or just coincidence?

0

0

0

Nick Silkey 📻 N5ILK 🪓🪵 🪣💧

nicksilkey@hachyderm.io

Reply to @corbet

@corbet thank you and the @lwn team for your service. ✌️💙

0

0

0

Jernej Simončič �

jernej__s@infosec.exchange

Reply to @corbet

@corbet @lwn A few months ago one of my clients' sites was getting 5000 requests per second, about 10-20 requests from a single IP, all residential. Server held up quite well until /var filled up because the logs didn't rotate fast enough (daily rotation, /var is 16 GB and normally has around 13 GB free).

0

0

0

Alexander S. Kunz

alexskunz@mas.to

Reply to @StompyRobot@mastodon.gamedev.place

@StompyRobot on my site it’s 90% residential proxies that camouflage as some Chrome. Not easy to block. Some pass simple automated challenges. @corbet @lwn

0

0

0

Andres

Andres4NY@social.ridetrans.it

Reply to @corbet

@corbet @lwn Website operators should be especially vocal about this, because too many people who have a positive view of AI companies have no idea..

0

1

0

𝔅icyclet𝓽𝓲𝓷𝓰

bicycletting@mastodon.ie

Reply to @corbet

@corbet @lwn Inkscape also got heated today:

https://floss.social/@doctormo/116825974597784492

0

0

0

Jonathan Corbet

corbet

Reply to @StompyRobot@mastodon.gamedev.place

@StompyRobot @lwn User agent is whatever random fiction they choose to put in there; there is no useful signal there. We really don't want to inflict captchas or cloudflare or any of that onto our readers, so we've had to find other ways to defend the site.

1

0

12

sergiodj

sergiodj@snac.sergiodj.net

Reply to @corbet

@corbet@social.kernel.org @lwn@lwn.net Amazing. I'd be interested in a post describing what you folks did, lessons learned, etc. (Assuming one doesn't exist already!)

0

0

0

Fazal Majid

fazalmajid@vivaldi.net

Reply to @BartV@mastodon.social

Edited 1 month ago

@BartV @corbet @lwn @blenderartists
Who would have a motive to DDoS LWN or Blender? Other than Microsoft and Adobe, of course.

Most likely those IPs are from residential proxies so you can't do an easy filtering rule like "Block all IPs in AWS/GCP/Azure address spaces". There were revelations last week than half of all Smart TV apps include residential proxy SDKs.

1

0

0

Alan Langford 🇨🇦🧤🧊摏

alan@mindly.social

Reply to @corbet

@corbet @lwn I am in the process of rolling out the next major version of my WAF and I've connected to the Abuse IP DB, which I now use to short circuit all the rest of the tests if the score is >=75. It's killing about 95% of the incoming traffic, and the WAF is getting about 95% of the rest (largely through ASN-wide blocks; host an AI scraper and you're dead to me unless you're whitelisted.)

0

0

0

FLOSSbOxIN

fbinin@mastodon.fbin.in

Reply to @corbet

@corbet @lwn
I had these few months back on my sites. Had to send them to hell by blocking out.

0

0

0

Jef Poskanzer

jef@mastodon.social

Reply to @BartV@mastodon.social

@BartV @corbet @lwn @blenderartists What's the difference?

0

0

0

Bradley M. Kühn

bkuhn@copyleft.org

Reply to @corbet

Edited 1 month ago

Anyone scrapping to (re)train LLMs is a selfish capitalist who doesn't care who they inconvenience &/or hurt.

We've enough LLMs for the foreseeable future. None are as Free and Open as we'd like, but I'm sure it's not someone trying to build a truly #FOSS LLM that's DDoS'ing #LWN rn.

All new #LLM training should stop *immediately*; continuing now on training is unconscionable. If you work for a company that is still training, I urge you to resign in protest.

Cc: @lwn @Andres4NY
@corbet

1

1

0

Bradley M. Kühn

bkuhn@copyleft.org

Reply to @fazalmajid@vivaldi.net

At SFC, we've been seeing the primary culprit is .cn IP numbers and Zuckerberg.

& I can confirm User-Agent is fiction, at least from those parties. robots.txt of course ignored.

Cc: @BartV @corbet @lwn @blenderartists

0

0

0

bignose

bignose@fosstodon.org

Reply to @bkuhn@copyleft.org

Agreed in the main, @bkuhn.

I imagine an obvious response is “we have to keep putting *new* data into the #LLM so that it stays up to date”. As far as it goes, yes that's true.

But why is that so urgent? Not enough to justify the hammering websites, the bulldozing of consent, the active deception to pass blocks, the refusal to countenance anything except #Hyperscaler interests. Stop it all, now.

1

0

0

Bradley M. Kühn

bkuhn@copyleft.org

Reply to @bignose@fosstodon.org

Edited 1 month ago

@bignose Even more than whether it is urgent, & even whether or not you're pro or against *using* LLM-gen-AI, the world is still figuring out if these monstrosities they are useful *for* (if anything).

The ballyhoo is clearly wrong, but I also think those who say they are not useful for anything are also wrong.

We (humanity) need at least two years to even begin to understand what we have & what it's for. Let's pause and figure that out without capitalists in the driver's seat.

0

0

0

TelH90

kkarhan@mastodon.social

Reply to @corbet

@corbet @lwn From my work experience I can say that the only remediation at that scale is #Blackholing traffic at #IX-level from all malicious ASNs used for said #DDoS and sending angry #AbuseReport mails every originator and their Upstreams.

- Make it THEIR PROBLEM!

Also let us know of the IP ranges so everyone else can block them as well!

1

0

0

Mans R

mansr@society.oftrolls.com

Reply to

@glitzersachen @corbet Is there any reason for them to be so aggressive or are they just incompetent? Regular search engine indexers seem to work just fine without causing trouble.

1

0

0

Edwin Török

edwintorok@discuss.systems

Reply to @corbet

@corbet @lwn @StompyRobot seems like you've found a nicer solution than https://anubis.techaro.lol/ (which introduces a delay for all legitimate users too).
I assume you won't be able to write much about the details of the defense, since that'll make it easier for the bots to circumvent the defenses?

0

0

0

Mike Taylor 🦕

mike@sauropods.win

Reply to @mansr@society.oftrolls.com

@mansr @glitzersachen @corbet I've wondered this, too. There is nothing obvious about web-scraping for AI models that would MAKE the bots behave like assholes; yet they do. Why?

1

0

0

Eloy.

eloy@hsnl.social

Reply to @mike@sauropods.win

@mike @mansr @glitzersachen @corbet My guess is that it's a combination of both incompetence with scraping efficiency and the scale of the scrapers: there are a handful of search engines, but everyone is trying to build their own AI models currently

1

0

0

Mike Taylor 🦕

mike@sauropods.win

Reply to @eloy@hsnl.social

@eloy @mansr @glitzersachen @corbet Ah, solid point that there are probably MANY organizations trying to scrape for their own models, and it only takes a tiny proportion of them to be incompetent or malicious to break everything.

1

0

0

Eloy.

eloy@hsnl.social

Reply to @mike@sauropods.win

@mike @mansr @glitzersachen @corbet Yep, exactly

1

0

0

Mike Anderson

mspcommentary@mastodon.online

Reply to @eloy@hsnl.social

@eloy @mike @mansr @glitzersachen @corbet but you wouldn't expect all of these incompetent scrapers to be hitting the site at the same time, would you?

1

0

0

Mike Taylor 🦕

mike@sauropods.win

Reply to @mspcommentary@mastodon.online

@mspcommentary @eloy @mansr @glitzersachen @corbet No, I'd interpret this as one incompetent scraper that has found a fantastically inefficient and hostile form of incompetence.

1

0

0

Jonathan Corbet

corbet

Reply to @kkarhan@mastodon.social

@kkarhan @lwn The problem it that it's *all* the ASNs. Probably even yours. These scrapers are built into apps and running on devices without the knowledge of their ostensible owners. Perhaps your phone is one of them.

Have a look at companies like Bright Data or opscloudio.com if you want to see how that sleazy business works.

2

5

10

RalfMaximus

ralfmaximus@mastodon.social

Reply to @mike@sauropods.win

@mike @mspcommentary @eloy @mansr @glitzersachen @corbet

Vibe coded!

0

0

0

anyGould

anyGould@kind.social

Reply to @corbet

@corbet OTOH, I'm willing to take that hit, because then I (as client) are going to be yelling upstream to my providers.

1

0

0

Eric Zarowny

ezarowny@file-explorers.club

Reply to @corbet

@corbet @lwn I would love to hear about your mitigation strategies!

1

0

0

Jonathan Corbet

corbet

Reply to @ezarowny@file-explorers.club

@ezarowny @lwn Someday I would love to talk about them. I'm somewhat reluctant to do that now, though, at least until I've figured out what we're going to do when those strategies stop being effective.

1

0

1

Eric Zarowny

ezarowny@file-explorers.club

Reply to @corbet

@corbet @lwn that’s understandable. I’m mostly just playing whack-a-mole with ASN’s at this point.

0

0

0

TelH90

kkarhan@mastodon.social

Reply to @corbet

@corbet @lwn If that's the case then the only valid option is to go " #fail2ban " - Style on said IPs and automate #AbuseReports to said ISPs.

- Cuz even lazy ones like #DTAG in #Germany will forcibly disconnect customers for running #malware.

I for once can guarantee this shit ain't on my devices, because said malware won't run on them!

1

0

0

TelH90

kkarhan@mastodon.social

Reply to @anyGould@kind.social

@anyGould @corbet +1

Give every affected IP (allocation) / ASN a redirect telling them that they've been blocked due to said #malware on their systems and that they've to remove it!

https://mastodon.social/@kkarhan/116834153763325544

0

0

0

Jonathan Corbet

corbet

Reply to @kkarhan@mastodon.social

@kkarhan @lwn So if we get one hit on, say, an article written in 2010, do we go through that whole process? How do we know that that isn't the one case of a real human following a link of interest? And how do we send, say, two-million abuse reports without just ending up on the spam blacklists ourselves?

Absolutist solutions like that sound good, but lack practicality.

1

0

1

TelH90

kkarhan@mastodon.social

Reply to @corbet

Edited 1 month ago

@corbet @lwn There are means to do just that…

Worst-Case bundle the IPs by ASN and sent 1 fax per 24 hours.

0

0

0

Dan York

danyork@mastodon.social

Reply to @corbet

@corbet Hopefully this news about the NetNut residential proxy platform will help diminish the attacks against @lwn - https://krebsonsecurity.com/2026/07/fbi-seizes-netnut-proxy-platform-popa-botnet/

When you first reported it, I shared it with a colleague involved with tracking residential proxies. They could see that lwn.net was showing up in some of the data they were getting about NetNut attacks.

1

0

0

Jonathan Corbet

corbet

Reply to @danyork@mastodon.social

@danyork @lwn Hopefully that will help, at least for a while. The takedown of IPIDEA earlier this year calmed things considerably for a few months. They always seem to rebuild their botnets, though...

0

0

2

About social.kernel.org

Terms of service

Please do not use this service in violation of the Linux Kernel Code of Conduct. Doing so will result in your account suspension with the referral of the matter to the CoC committee.
"Repeating"/"boosting" someone else's status on this platform will be treated as endorsement and will fall under rule #1.
You are encouraged to use this platform to promote your work on the Linux Kernel, but there is no restriction on permitted topics (with the exception of anything covered by #1 above).
There is no requirement to post in English, but it should be considered the primary language of communication on this platform.

Privacy notice

The admins of this service have access to all posted statuses. They aren't looking, but if it's something they shouldn't know about, then you should not post it on this platform.

Please see the Linux Foundation Privacy Policy, which applies to this platform as well.

Getting your own account

If you would like an account on this instance, please check that the following applies to you:

You are listed in MAINTAINERS or CREDITS
OR: You have a kernel.org account or email address
OR: You have a long and established history of involvement with the Linux Kernel

If the above is true and you agree with the Terms of Service and Privacy Notice listed above, please use these instructions to request an account:

How to request an account on social.kernel.org