@corbet @lwn same on @blenderartists right now with up to 3.7M requests/hour. Looks more like a DDoS in our case than AI scraping though (which I suspected at first)
@corbet @lwn A few months ago one of my clients' sites was getting 5000 requests per second, about 10-20 requests from a single IP, all residential. Server held up quite well until /var filled up because the logs didn't rotate fast enough (daily rotation, /var is 16 GB and normally has around 13 GB free).
@StompyRobot on my site it’s 90% residential proxies that camouflage as some Chrome. Not easy to block. Some pass simple automated challenges. @corbet @lwn
@BartV @corbet @lwn @blenderartists
Who would have a motive to DDoS LWN or Blender? Other than Microsoft and Adobe, of course.
Most likely those IPs are from residential proxies so you can't do an easy filtering rule like "Block all IPs in AWS/GCP/Azure address spaces". There were revelations last week than half of all Smart TV apps include residential proxy SDKs.
@corbet @lwn I am in the process of rolling out the next major version of my WAF and I've connected to the Abuse IP DB, which I now use to short circuit all the rest of the tests if the score is >=75. It's killing about 95% of the incoming traffic, and the WAF is getting about 95% of the rest (largely through ASN-wide blocks; host an AI scraper and you're dead to me unless you're whitelisted.)
Anyone scrapping to (re)train LLMs is a selfish capitalist who doesn't care who they inconvenience &/or hurt.
We've enough LLMs for the foreseeable future. None are as Free and Open as we'd like, but I'm sure it's not someone trying to build a truly #FOSS LLM that's DDoS'ing #LWN rn.
All new #LLM training should stop *immediately*; continuing now on training is unconscionable. If you work for a company that is still training, I urge you to resign in protest.
At SFC, we've been seeing the primary culprit is .cn IP numbers and Zuckerberg.
& I can confirm User-Agent is fiction, at least from those parties. robots.txt of course ignored.
Agreed in the main, @bkuhn.
I imagine an obvious response is “we have to keep putting *new* data into the #LLM so that it stays up to date”. As far as it goes, yes that's true.
But why is that so urgent? Not enough to justify the hammering websites, the bulldozing of consent, the active deception to pass blocks, the refusal to countenance anything except #Hyperscaler interests. Stop it all, now.
@bignose Even more than whether it is urgent, & even whether or not you're pro or against *using* LLM-gen-AI, the world is still figuring out if these monstrosities they are useful *for* (if anything).
The ballyhoo is clearly wrong, but I also think those who say they are not useful for anything are also wrong.
We (humanity) need at least two years to even begin to understand what we have & what it's for. Let's pause and figure that out without capitalists in the driver's seat.
@corbet @lwn From my work experience I can say that the only remediation at that scale is #Blackholing traffic at #IX-level from all malicious ASNs used for said #DDoS and sending angry #AbuseReport mails every originator and their Upstreams.
- Make it THEIR PROBLEM!
Also let us know of the IP ranges so everyone else can block them as well!
@glitzersachen @corbet Is there any reason for them to be so aggressive or are they just incompetent? Regular search engine indexers seem to work just fine without causing trouble.
@corbet @lwn @StompyRobot seems like you've found a nicer solution than https://anubis.techaro.lol/ (which introduces a delay for all legitimate users too).
I assume you won't be able to write much about the details of the defense, since that'll make it easier for the bots to circumvent the defenses?
@mansr @glitzersachen @corbet I've wondered this, too. There is nothing obvious about web-scraping for AI models that would MAKE the bots behave like assholes; yet they do. Why?
@mike @mansr @glitzersachen @corbet My guess is that it's a combination of both incompetence with scraping efficiency and the scale of the scrapers: there are a handful of search engines, but everyone is trying to build their own AI models currently
@eloy @mansr @glitzersachen @corbet Ah, solid point that there are probably MANY organizations trying to scrape for their own models, and it only takes a tiny proportion of them to be incompetent or malicious to break everything.
@eloy @mike @mansr @glitzersachen @corbet but you wouldn't expect all of these incompetent scrapers to be hitting the site at the same time, would you?
@mspcommentary @eloy @mansr @glitzersachen @corbet No, I'd interpret this as one incompetent scraper that has found a fantastically inefficient and hostile form of incompetence.
@mike @mspcommentary @eloy @mansr @glitzersachen @corbet
Vibe coded!
@corbet OTOH, I'm willing to take that hit, because then I (as client) are going to be yelling upstream to my providers.
@corbet @lwn If that's the case then the only valid option is to go " #fail2ban " - Style on said IPs and automate #AbuseReports to said ISPs.
- Cuz even lazy ones like #DTAG in #Germany will forcibly disconnect customers for running #malware.
I for once can guarantee this shit ain't on my devices, because said malware won't run on them!
Give every affected IP (allocation) / ASN a redirect telling them that they've been blocked due to said #malware on their systems and that they've to remove it!