@corbet 100% agree. Hosting company MD here, we've seen a massive uptick in AI bullshit. And they don't even respect robots.txt like the better search engines do.
Thank you @corbet and all at @LWN for continuing the work of providing the excellent #LWN.
The "active defenses" against torrents of antisocial web scraping bots, has bad effects on users. They tend to be "if you don't allow JavaScript and cookies, you can't visit the site" even if the site itself works fine without.
I don't have a better defense to offer, but it's really closing off huge portions of the web that would otherwise be fine for secure browsers.
It sucks. Sorry, and thank you.
@corbet @LWN I think we should start doing what the internet can do best: Collaborate on these things.
I see this on my services, Xe recently saw the same. https://xeiaso.net/notes/2025/amazon-crawler/ (and build a solution https://xeiaso.net/blog/2025/anubis/)
There is https://zadzmo.org/code/nepenthes/
I would love to see some kind of effort to map out bot IPs and get a public block list. I'm tired of their nonsense.
@corbet @johnefrancis @LWN I'm dealing with a similar issue now (though likely at a smaller scale than LWN!), and I found that leading crawlers into a maze helped a lot in discovering UAs and IP ranges that misbehave. Anyone who spends an unreasonable time in the maze gets rate limited, and served garbage.
So far, the results are very good. I can recommend a similar strategy.
Happy to share details and logs, and whatnot, if you're interested. LWN is a fantastic resource, and AI crawlers don't deserve to see it.
@monsieuricon @LWN @corbet are you implying that there are models that are busy being trained to call someone a fuckface over misunderstanding of some obscure arm coprocessor register or respond with viro insults to the most unsuspecting victims?
@corbet @LWN @monsieuricon it's not the copilot we need but it's the copilot we deserve
@corbet @LWN@fosstodon.org Cloudflare has an AI scraper bot block that’s free guys.
"Any kind of active defense is going to have to figure out how to block subnets rather than individual addresses, and even that may not do the trick. "
if you're using iptables, ipset can block individual ips (hash:ip), and subnets (hash:net).
Just set it up last night for my much-smaller-traffic instances, feel free to DM
Also tarpits! And nepenthes and nepenthes-adjacent tech!
https://tldr.nettime.org/@asrg/113867412641585520
https://gist.github.com/flaviovs/103a0dbf62c67ff371ff75fc62fdded3
@corbet @LWN not sure if it works for LWN but I learned about this today: https://git.madhouse-project.org/algernon/iocaine
@corbet Nope, no Javascript needed. It operates at Layer 4.
@corbet @LWN You know, what we need is a clearinghouse for this like there are for piholes and porn and such. Could someone with some followers get #AIblacklist trending?
Post your subnets with that hashtag. If we get any traction, I'll host the list.
@corbet @LWN I was just reading about https://git.madhouse-project.org/algernon/iocaine
@corbet@social.kernel.org @LWN@fosstodon.org
Time to set up AI-poisoning bots.
Really great part of this BS is that if you're not a hyperscale social media platform, your ability to afford adequate defenses is going to be awful.
@corbet @LWN @AndresFreundTec Maybe the bot wrote the code itself?
> Sabot in the Age of AI
> Here is a curated list of strategies, offensive methods, and tactics for (algorithmic) sabotage, disruption, and deliberate poisoning.
@corbet
In my timeline your post appeared directlt beneath this one https://tldr.nettime.org/@asrg/113867412641585520 Coincidence????
@LWN
@corbet @LWN i'm not sure if you've already got a strategy for dealing with the scrapers already in mind, but if not -
dialup.cafe's running on nginx, and this has worked well for me so far:
https://rknight.me/blog/blocking-bots-with-nginx/
an apache translation of that using .htaccess would be possible as well.
@sheogorath @corbet @LWN I agree.... this is very similar to the early days of antispam, IMO. I wonder if there's a way to detect abusive scraping (via hits on hidden links, etc.) and publish to a shared DNS blocklist?
@gme @corbet CloudFlare is only free until they smell money (i.e. significant traffic). Then they tell you you're over the (opaque) free plan limits, and demand you pay up, using the possibility of terminating your service as leverage in the subsequent pricing negotiations. If you think you might want to use them (which I don't recommend), start those negotiations before they have any leverage on you.
@corbet
This just came into my timeline, in case it helps : https://tldr.nettime.org/@asrg/113867412641585520
@LWN
@corbet @LWN @AndresFreundTec Maybe it's sabotage internally, so it's not /quite/ as bad. That's what I'd do.
@corbet @LWN I'm sorry, I know this is a pain in the butt to deal with and that it's kind of demoralizing.
Is there anything I can do to help? I'm already a subscriber, and a very happy one; but if it'll diminish the demoralization at all, I really appreciate that you're tackling this problem. Can I get you a pizza or something?
@algernon @corbet @johnefrancis @LWN thank you for offering to protect a thing I love
@corbet @LWN I sympathize, it's an exasperating problem. I've found microcaching all public facing content to be extremely effective.
- The web server sits behind a proxy micro cache
- Preseed the cache by crawling every valid public path
- Configure the valid life of cached content to be as short as you want
- Critically, ensure that every request is always served stale cached content while the cache leisurely repopulates with a fresh copy. This eliminates bot overwhelm by decoupling requests from ANY IP from the request rate hitting the upstream
- Rather than blocking aggressive crawlers, configure rate-limiting customized by profiling max predicted human rate
- For bots with known user agents, plus those detected by profiling their traffic, divert all their requests to a duplicate long lived cache that never invokes the upstream for updates
Micro caching has saved us thousands on compute, and eliminated weekly outages caused by abusive bots. It's open to significant tuning to improve efficiency based on your content.
Shout out to the infrastructure team at NPR@flipboard.com - a blog post they published 9 years ago (now long gone) described this approach.