social.kernel.org

Conversation

Jonathan Corbet

corbet

Should you be wondering why @LWN #LWN is occasionally sluggish... since the new year, the DDOS onslaughts from AI-scraper bots has picked up considerably. Only a small fraction of our traffic is serving actual human readers at this point. At times, some bot decides to hit us from hundreds of IP addresses at once, clogging the works. They don't identify themselves as bots, and robots.txt is the only thing they *don't* read off the site.

This is beyond unsustainable. We are going to have to put time into deploying some sort of active defenses just to keep the site online. I think I'd even rather be writing about accounting systems than dealing with this crap. And it's not just us, of course; this behavior is going to wreck the net even more than it's already wrecked.

Happy new year :)

43

435

354

K. Ryabitsev 🍁

Reply to @corbet

@corbet @LWN I feel your pain so much right now.

2

0

5

Greg Harvey 🌍

greg_harvey@tooting.ch

Reply to @corbet

@corbet 100% agree. Hosting company MD here, we've seen a massive uptick in AI bullshit. And they don't even respect robots.txt like the better search engines do.

0

0

0

Mythic Beasts

beasts@social.mythic-beasts.com

Reply to @corbet

@corbet @LWN in our experience you should prepare for thousands of distinct IPs.

1

0

0

nirik

nirik@fosstodon.org

Reply to @corbet

@corbet @LWN same here in #fedora infra. I had to block a bunch this morning to keep pagure.io usable. 😢

1

0

0

John Francis 🦫🇨🇦🍁💪⬆️

johnefrancis@cosocial.ca

Reply to @corbet

@corbet @LWN sounds like you need an AI poisoner like Nerpenthes or iocaine.

1

0

0

Jonathan Corbet

corbet

Reply to @beasts@social.mythic-beasts.com

@beasts @LWN We are indeed seeing that sort of pattern; each IP stays below the thresholds for our existing circuit breakers, but the overload is overwhelming. Any kind of active defense is going to have to figure out how to block subnets rather than individual addresses, and even that may not do the trick.

3

1

3

bignose

bignose@sw-development-is.social

Reply to @corbet

Thank you @corbet and all at @LWN for continuing the work of providing the excellent #LWN.

The "active defenses" against torrents of antisocial web scraping bots, has bad effects on users. They tend to be "if you don't allow JavaScript and cookies, you can't visit the site" even if the site itself works fine without.

I don't have a better defense to offer, but it's really closing off huge portions of the web that would otherwise be fine for secure browsers.

It sucks. Sorry, and thank you.

1

0

0

Jonathan Corbet

corbet

Reply to @johnefrancis@cosocial.ca

@johnefrancis @LWN Something like nepenthes (https://zadzmo.org/code/nepenthes/) has crossed my mind; it has its own risks, though. We had a suggestion internally to detect bots and only feed them text suggesting that the solution to every world problem is to buy a subscription to LWN. Tempting.

5

3

36

Sheogorath

sheogorath@shivering-isles.com

Reply to @corbet

@corbet @LWN I think we should start doing what the internet can do best: Collaborate on these things.

I see this on my services, Xe recently saw the same. https://xeiaso.net/notes/2025/amazon-crawler/ (and build a solution https://xeiaso.net/blog/2025/anubis/)

There is https://zadzmo.org/code/nepenthes/

I would love to see some kind of effort to map out bot IPs and get a public block list. I'm tired of their nonsense.

1

1

0

Jonathan Corbet

corbet

Reply to @bignose@sw-development-is.social

Edited 4 months ago

@bignose @LWN We have gone far out of our way to never require JavaScript to read LWN; we're not going back on that now.

0

2

11

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @corbet

@corbet @johnefrancis @LWN I'm dealing with a similar issue now (though likely at a smaller scale than LWN!), and I found that leading crawlers into a maze helped a lot in discovering UAs and IP ranges that misbehave. Anyone who spends an unreasonable time in the maze gets rate limited, and served garbage.

So far, the results are very good. I can recommend a similar strategy.

Happy to share details and logs, and whatnot, if you're interested. LWN is a fantastic resource, and AI crawlers don't deserve to see it.

3

0

0

Grant Stephens

rexfuzzle@hub13.xyz

Reply to @corbet

@corbet @LWN @devs can we help?

0

0

0

pope of nope 🇺🇦

lkundrak@metalhead.club

Reply to @monsieuricon

@monsieuricon @LWN @corbet are you implying that there are models that are busy being trained to call someone a fuckface over misunderstanding of some obscure arm coprocessor register or respond with viro insults to the most unsuspecting victims?

1

0

0

Jonathan Corbet

corbet

Reply to @lkundrak@metalhead.club

@lkundrak @monsieuricon @LWN It's a service we provide :)

1

0

11

pope of nope 🇺🇦

lkundrak@metalhead.club

Reply to @corbet

@corbet @LWN @monsieuricon it's not the copilot we need but it's the copilot we deserve

0

0

0

George Ellenburg (he/him/his)

gme@bofh.social

Reply to @corbet

@corbet @LWN@fosstodon.org Cloudflare has an AI scraper bot block that’s free guys.

1

0

0

Adelie

adelie@darkpenguin.social

Reply to @corbet

"Any kind of active defense is going to have to figure out how to block subnets rather than individual addresses, and even that may not do the trick. "

if you're using iptables, ipset can block individual ips (hash:ip), and subnets (hash:net).

Just set it up last night for my much-smaller-traffic instances, feel free to DM

https://ipset.netfilter.org/

1

0

0

Jonathan Corbet

corbet

Reply to @gme@bofh.social

@gme I assume you're referring to https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click/ ?

It would appear to force readers to enable JavaScript, which we don't want to do. Plus it requires running all of our readers through cloudflare, of course...and I suspect that the "free tier" is designed to exclude sites like ours. So probably not a solution for us, but it could well work for others.

1

0

2

Jonathan Corbet

corbet

Reply to @adelie@darkpenguin.social

@adelie @LWN Blocking a subnet is not hard; the harder part is figuring out *which* subnets without just blocking huge parts of the net as a whole.

2

0

1

K. Ryabitsev 🍁

Reply to @corbet

@corbet @adelie @LWN I have been using pyasn to block entire subnets. It's effective, but only in the same way carpet bombing is. I'm sure I've blocked legitimate systems, but c'est la vie.

0

0

5

Adelie

adelie@darkpenguin.social

Reply to @corbet

Probably a good question for the fedi as a whole. I started with any 40x response in my logs, added any spamhaus hits from my mail server, and any user-agents with "bot" in the name. Plus facebook in particular has huge ipv4 blocks just for scraping, also easy to block.

1

0

0

Adelie

adelie@darkpenguin.social

Reply to @adelie@darkpenguin.social

Also tarpits! And nepenthes and nepenthes-adjacent tech!

https://tldr.nettime.org/@asrg/113867412641585520

https://gist.github.com/flaviovs/103a0dbf62c67ff371ff75fc62fdded3

1

0

0

Ronny Adsetts

RonnyAdsetts@mastodon.social

Reply to @corbet

@corbet @LWN would you be so kind as to write up whatever mitigations you come up with? I've been fighting this myself on our websites. You seeing semi-random user agents too?

1

0

0

Jonathan Corbet

corbet

Reply to @RonnyAdsetts@mastodon.social

@RonnyAdsetts @LWN The user agent field is pure fiction for most of this traffic.

0

0

2

George Ellenburg (he/him/his)

gme@bofh.social

Reply to @corbet

@corbet Nope, no Javascript needed. It operates at Layer 4.

1

0

0

Adelie

adelie@darkpenguin.social

Reply to @adelie@darkpenguin.social

@corbet @LWN You know, what we need is a clearinghouse for this like there are for piholes and porn and such. Could someone with some followers get #AIblacklist trending?

Post your subnets with that hashtag. If we get any traction, I'll host the list.

0

0

0

J. "Henry" Waugh

jhwgh1968@chaos.social

Reply to

@ted @corbet @LWN yeah I was going to say, a post directly above this one (with other tools as well):

https://tldr.nettime.org/@asrg/113867412641585520

0

0

0

halfa

halfa@mastodon.tedomum.net

Reply to @corbet

@corbet @LWN @beasts JS challenges somewhat works, at the cost of accessibility for JS-free browser

0

0

0

AndresFreundTec

AndresFreundTec@mastodon.social

Reply to @corbet

@corbet @LWN Do you see a lot of pointlessly redundant requests? I see a lot of related-seeming IPs request the same pages over and over.

1

0

0

Firstyear

firstyear@infosec.exchange

Reply to @corbet

@corbet @LWN I was just reading about https://git.madhouse-project.org/algernon/iocaine

0

0

0

SpaceLifeForm

SpaceLifeForm@infosec.exchange

Reply to @corbet

Check out Nepenthes in defensive mode.

0

0

0

Jonathan Corbet

corbet

Reply to @AndresFreundTec@mastodon.social

@AndresFreundTec @LWN Yes, a lot of really silly traffic. About 1/3 of it results in redirects from bots hitting port 80; you don't see them coming back with TLS, they just keep pounding their head against the same wall.

It is weird; somebody has clearly put some thought into creating a distributed source of traffic that avoid tripping the per-IP circuit breakers. But the rest of it is brainless.

3

0

3

ferricoxide@evil.social

ferricoxide@evil.social

Reply to @corbet

@corbet@social.kernel.org @LWN@fosstodon.org

Time to set up AI-poisoning bots.

Really great part of this BS is that if you're not a hyperscale social media platform, your ability to afford adequate defenses is going to be awful.

0

0

0

Henrik Grindal Bakken

Reply to @corbet

@corbet @LWN @AndresFreundTec Maybe the bot wrote the code itself?

0

0

0

irelephant@calckey.world

irelephant@calckey.world

Reply to @corbet

@corbet @LWN Sounds awful. You should consider setting up something like cloudflare or deflect.

0

0

0

Andreas

ePD5qRxX@mastodon.online

Reply to @corbet

> Sabot in the Age of AI
> Here is a curated list of strategies, offensive methods, and tactics for (algorithmic) sabotage, disruption, and deliberate poisoning.

https://tldr.nettime.org/@asrg/113867412641585520

0

0

0

Fonant

Fonant@vivaldi.net

Reply to @corbet

@corbet @LWN Yup, my servers too. Sometimes GPTBot as the UserAgent, but often not.

The AI bullshit merchants are slowly killing the web.

0

0

0

ZXGuesser

zxguesser@oldbytes.space

Reply to @corbet

@corbet @LWN @beasts
Large amounts coming from Huawei Cloud asns and trying to spider every possible GET parameter?

0

0

0

Johannes Hentschel

johentsch@hostux.social

Reply to @corbet

@corbet
In my timeline your post appeared directlt beneath this one https://tldr.nettime.org/@asrg/113867412641585520 Coincidence????
@LWN

0

0

0

vga256

vga256@dialup.cafe

Reply to @corbet

@corbet @LWN i'm not sure if you've already got a strategy for dealing with the scrapers already in mind, but if not -

dialup.cafe's running on nginx, and this has worked well for me so far:
https://rknight.me/blog/blocking-bots-with-nginx/

an apache translation of that using .htaccess would be possible as well.

0

0

0

Justin Mason

jmason@mastodon.ie

Reply to @corbet

@corbet @LWN it's disgusting to find the LLM companies using these disguised scraping practices. Clearly they recognise that they are acting abusively

0

0

0

sree

sree@ublog.thirdlaw.net

Reply to @corbet

@corbet @LWN Looking forward to the Grumpy Editor article on dealing with AI scraping bots!

0

0

0

Justin Mason

jmason@mastodon.ie

Reply to

@glent @corbet @LWN how on earth does that make any sense

0

0

0

Justin Mason

jmason@mastodon.ie

Reply to @sheogorath@shivering-isles.com

@sheogorath @corbet @LWN I agree.... this is very similar to the early days of antispam, IMO. I wonder if there's a way to detect abusive scraping (via hits on hidden links, etc.) and publish to a shared DNS blocklist?

0

0

0

Andrew Drake

adrake@sfba.social

Reply to @gme@bofh.social

@gme @corbet CloudFlare is only free until they smell money (i.e. significant traffic). Then they tell you you're over the (opaque) free plan limits, and demand you pay up, using the possibility of terminating your service as leverage in the subsequent pricing negotiations. If you think you might want to use them (which I don't recommend), start those negotiations before they have any leverage on you.

0

0

0

Seb-Solon

Seb_Solon@framapiaf.org

Reply to @corbet

@corbet
I'll be happy to hear about the solutions you end up finding / or not

Good luck on that matter.
@LWN

1

0

0

Carl Schwan

carl@kde.social

Reply to @corbet

@corbet @LWN Same for KDE gitlab instance. It's a pain :(

0

0

0

Seb-Solon

Seb_Solon@framapiaf.org

Reply to @Seb_Solon@framapiaf.org

@corbet
This just came into my timeline, in case it helps : https://tldr.nettime.org/@asrg/113867412641585520
@LWN

0

0

0

Furbland's Very Cool Mastodon™

GroupNebula563@mastodon.social

Reply to @corbet

@corbet @LWN I know Cloudflare has some fashion of AI-blocking doodad, it might be worth looking into that?

0

0

0

Alan Langford 🇨🇦🧤🧊摏

alan@mindly.social

Reply to @corbet

@corbet @LWN I have resorted to a wide swath of blocks, in Bytedance's case, blocking entire ASN's (most recently all of Meta). Other wide blocks are on the user agent. Ironically my big load spikes are now from a huge number of servers running ActivityPub whenever one of my sites is linked to!

1

0

0

The Doctor

drwho@hackers.town

Reply to @corbet

@corbet @LWN I'll just leave this here...

https://git.madhouse-project.org/algernon/iocaine

0

0

0

The Doctor

drwho@hackers.town

Reply to @corbet

@corbet @LWN @AndresFreundTec Maybe it's sabotage internally, so it's not /quite/ as bad. That's what I'd do.

0

0

0

Ben Zanin

gnomon@mastodon.social

Reply to @corbet

@corbet @LWN I'm sorry, I know this is a pain in the butt to deal with and that it's kind of demoralizing.

Is there anything I can do to help? I'm already a subscriber, and a very happy one; but if it'll diminish the demoralization at all, I really appreciate that you're tackling this problem. Can I get you a pizza or something?

0

0

0

Ben Zanin

gnomon@mastodon.social

Reply to @algernon@come-from.mad-scientist.club

@algernon @corbet @johnefrancis @LWN thank you for offering to protect a thing I love

0

0

0

Jessica D Dooley

jessicaddooley@infosec.exchange

Reply to @corbet

@corbet @LWN I sympathize, it's an exasperating problem. I've found microcaching all public facing content to be extremely effective.

- The web server sits behind a proxy micro cache
- Preseed the cache by crawling every valid public path
- Configure the valid life of cached content to be as short as you want
- Critically, ensure that every request is always served stale cached content while the cache leisurely repopulates with a fresh copy. This eliminates bot overwhelm by decoupling requests from ANY IP from the request rate hitting the upstream
- Rather than blocking aggressive crawlers, configure rate-limiting customized by profiling max predicted human rate
- For bots with known user agents, plus those detected by profiling their traffic, divert all their requests to a duplicate long lived cache that never invokes the upstream for updates

Micro caching has saved us thousands on compute, and eliminated weekly outages caused by abusive bots. It's open to significant tuning to improve efficiency based on your content.

Shout out to the infrastructure team at NPR@flipboard.com - a blog post they published 9 years ago (now long gone) described this approach.

0

0

0

Phracker

Phracker2Art@mstdn.social

Reply to @corbet

@corbet @LWN
Maybe try implementing some sort of Captcha system where a user or user agent has to prove that they're human in order to use the site.

0

0

0

lord pthenq1

pthenq1@mastodon.la

Reply to @corbet

I had my website behind cloudfare. It works. And mitigates successfully the bot attacks.

We were attacked more than 100 times from 2020....and almost nobody noticed

0

0

0

i.grok

igrok@hachyderm.io

Reply to @nirik@fosstodon.org

@nirik @corbet @LWN perhaps you can start sharing lists?

0

0

0

matthewcroughan

matthewcroughan@defenestrate.it

Reply to @corbet

Time to switch to gemini://

0

0

0

Jojonintendo

Jojonintendo@fosstodon.org

Reply to @corbet

@corbet @LWN Maybe something like @CrowdSec could help somehow? I guess if an "anti-AI" list were to be made, it would protect many users.

0

0

0

DamonHD

DamonHD@mastodon.social

Reply to

@dysfun @beasts @corbet @LWN thing's I do include quickly rejecting no-referer requests on anything other than legit landing pages, rejecting all query paras where not legit, rejecting more edge cases when under stress. All (Apache) server config rules.

0

0

0

Ric

dev_ric@fosstodon.org

Reply to @corbet

@corbet @LWN @beasts https://ip-tool.qweb.co.uk has buttons for generating htaccess, nginx, and iptables configs for entire network blocks. Just paste a malicious IP in, tap the htaccess button, and paste into your htaccess file, for example.

Also helps to have Fail2Ban set up to firewall anything that triggers too many 403s, so that htaccess blocks on one site become server wide blocks protecting all sites.

My general rate limiter for nginx is useful too: https://github.com/qwebltd/Useful-scripts/blob/main/Bash%20scripts%20for%20Linux/nginx-rate-limiting.conf

0

0

0

DamonHD

DamonHD@mastodon.social

Reply to @corbet

@corbet @LWN @AndresFreundTec read Hacker News and the like and you will see that there are hundreds or thousands or more idiots trying to scrape their way to riches. It's distributed idiocy in part rather than algorithmic DDoS.

0

0

0

Ayo

ayo@lonely.town

Reply to @corbet

@corbet @johnefrancis @LWN
Struggling with likely the same bots over here. I deployed a similar tarpit* on a large-ish site a few days ago - taking care not to trap the good bots - but can't say it's been very successful. It might have taken some load off of the main site, but not nearly enough to make a difference.

One more thing I'm considering is prefixing all internal links with a '/botcheck/' path for potentially suspicious visitors, set a cookie on that page and strip that prefix with JS. If the cookie is set on the /botcheck/ endpoint, redirect to the proper page, otherwise tarpit them. This way the site would still work as long as the user has *either* JS or cookies enabled. Still not perfect, but slightly less invasive than most common active defenses.

*) https://code.blicky.net/yorhel/infinite-slop

0

0

1

René Mayrhofer 🇺🇦 🇹🇼

rene_mobile@infosec.exchange

Reply to @corbet

@corbet @LWN I am beginning to work on defenses against bots not respecting robots.txt, but it might take a bit until it comes together (not hard, but not enough time for programming these days).

1

0

0

Kaleissin

kaleissin@wandering.shop

Reply to @algernon@come-from.mad-scientist.club

@algernon @corbet @johnefrancis @LWN Write-up yes, please!

0

0

0

Richard Hughes

hughsie@mastodon.social

Reply to @corbet

@corbet @LWN we've also had to put a IP block on firmware downloads from the #LVFS per day because of AI scrapers -- which makes everyone else's life a little harder.

The scraper useragent is completely wrong and dynamic (but plausible) and they seem to completely ignore robots.txt. Quite what AI robots want with GBs of firmware archives is quite beyond me.

https://lore.kernel.org/lvfs-announce/zDlhotSvKqnMDfkCKaE_u4-8uvWsgkuj18ifLBwrLN9vWWrIJjrYQ-QfhpY3xuwIXuZgzOVajW99ymoWmijTdngeFRVjM0BxhPZquUzbDfM=@hughsie.com/T/#u for details.

0

1

0

chebra

chebra@mstdn.io

Reply to @corbet

@corbet @johnefrancis @LWN You could also make this structure of pages several levels deep and once you are at a level where no living human would reasonably go just automatically add that IP to the blocklist (and share it with others).

0

0

0

chebra

chebra@mstdn.io

Reply to @algernon@come-from.mad-scientist.club

@algernon @corbet @johnefrancis @LWN Can it be turned into a traefik middleware and released under AGPL on codeberg?

0

0

0

Daniel Bovensiepen

daniel@bovi.social

Reply to @corbet

@corbet @LWN how about restricting reading to logged-in people only and then block the bot requests early in the pipeline to reduce the load

1

0

0

Peter Amstutz

tetron@hachyderm.io

Reply to @corbet

@corbet
@LWN
We are seeing this too, on arvados.org and other sites we operate, and have also seen enough complaints across HN, Reddit, fedi etc that I'm sure the whole internet is dealing with this bot problem. It reminds me of the massive wave of telemarketing calls that changed everyone's habits so that people don't answer unfamiliar phone numbers any more.

Poetic justice would be serving enough garbage text to the bots that it accelerates the AI model collapse.

0

0

0

Michael K Johnson

mcdanlj@social.makerforums.info

Reply to @corbet

@corbet @LWN I've been blocking large swathes of subnets coming from cloud providers to the forum I run. Anthropic has been the worst offender by far; everyone else seemed to at least rate limit, but Anthropic seemed determined to destroy the internet.

That may also cut off access from arbitrary VPN providers as well, but then many of our spammers and few of our legit people are coming through VPN providers, and as far as I can tell most of the VPN providers are shady. (Is there a trustworthy VPN provider other than Mullvad?)

0

0

0

Alan Langford 🇨🇦🧤🧊摏

alan@mindly.social

Reply to @alan@mindly.social

@corbet @LWN I should mention that the ASN is available through ipinfo.io. If you're working in PHP, @abivia has a library for it.

0

0

0

Jonathan Corbet

corbet

Reply to @daniel@bovi.social

@daniel @LWN The problem with restricting reading to logged-in people is that it will surely interfere with our long-term goal to have the entire world reading LWN. We really don't want to put roadblocks in front of the people we are trying to reach.

0

0

3

Jake - DC0OL

DC0OL@mstdn.social

Reply to @corbet

@corbet
Just in case it has not been mentioned. What about throtteling if a certain threshold is reached? Humans are no fast readers compared with bots. Might increase the tracking tables for the connections. Just my 2ct.
@LWN

0

0

0

Network == Abstraction Layer

overunderlay@bsd.network

Reply to @corbet

@corbet @LWN @dahukanna has some good suggestions
IIRC blocking by crawler user agent send to work reasonably well

1

0

0

d0rk ✅

drwetter@mastodon.social

Reply to @corbet

@corbet
Don't know whether that is an option but Cloudflare has an option to block AI bots.
@LWN

0

0

0

Geert Uytterhoeven

geert@society.oftrolls.com

Reply to @monsieuricon

@monsieuricon @LWN @corbet Is this why lore is so slow today?

0

0

0

🔗 David Sommerseth

dazo@infosec.exchange

Reply to @corbet

We had a suggestion internally to detect bots and only feed them text suggesting that the solution to every world problem is to buy a subscription to LWN.

What are you waiting for, @corbet? 😉

0

0

0

Solarpunk Davy

SolarDavy@climatejustice.social

Reply to @corbet

@corbet @LWN uhm, someone could, for a friend, check this out? https://git.madhouse-project.org/algernon/iocaine#iocaine

0

0

0

Michael K Johnson

mcdanlj@social.makerforums.info

Reply to @corbet

@corbet @LWN I'm wondering if a link that a human wouldn't click on but an AI wouldn't know any better than to follow could be used in nginx configuration to serve AI robots differently from humans, in a configuration that excluded search crawlers from that configuration. What such a link would look like would be different on different sites. That would require thought from every site, but also that would create diversity which would make it harder to guard against on the scraper side, so possibly could be more effective.

I might be an outlier here for my feelings on whether training genai such as LLMs from publicly-posted information is OK. It felt weird decades ago when I was asked for permission to put content I posted to usenet onto a CD (why would I care whether the bits were carried to the final reader on a phone line someone paid for or a CD someone paid for?) so it's not inconsistent in my view that I would personally feel that it's OK to use what I post publicly to train genai. (I respect that others feel differently here.)

That said, I'm beyond livid at being the target of a DDoS, and other AI engines might end up being collateral damage as I try to protect my site for use by real people.

3

0

0

Network == Abstraction Layer

overunderlay@bsd.network

Reply to @overunderlay@bsd.network

@corbet @LWN @dahukanna
see also https://saturation.social/@clive/113878996590959757

0

0

0

Michael K Johnson

mcdanlj@social.makerforums.info

Reply to @mcdanlj@social.makerforums.info

@corbet @LWN Also, the link that a human wouldn't click on should be <meta name="robots" content="noindex"> and a robots.txt section

User-agent: *
Disallow: /the/honeytrap/url

That way, all well-behaved robots that honor robots.txt, including search engines, would continue to work, and only the idiots who think they are above the rules will fall into it.

0

0

0

Jonathan Corbet

corbet

Reply to @mcdanlj@social.makerforums.info

@mcdanlj @LWN What a lot of people are suggesting (nepethenes and such) will work great against a single abusive robot. None of it will help much when tens of thousands of sites are grabbing a few URLs each. Most of them will never step into the honeypot, and the ones that do will not be seen again regardless.

1

0

2

Joe Brockmeier (@jzb)

jzb@mastodon.social

Reply to @mcdanlj@social.makerforums.info

@mcdanlj Even if you are OK with GenAI being trained on publicly available data -- those scrapers should abide by convention and if a robots.txt says "no" they should honor it. The fact they do not says volumes about the ethics of the people behind a lot of these GenAI companies.

1

0

0

Michael K Johnson

mcdanlj@social.makerforums.info

Reply to @jzb@mastodon.social

@jzb @corbet @LWN I'm saying I'm not reflexively "all AI is evil" and I still am beyond incensed at this abuse. I completely agree that dishonoring robots.txt, not providing a meaningful user agent, and running a continuous DDoS is a sign that they are morally corrupt.

The difference between Anthropic and a script kiddie with a stolen credit card or a botnet is that the script kiddie will eventually get bored and go attack someone else, as far as I can tell.

0

0

0

Michael K Johnson

mcdanlj@social.makerforums.info

Reply to @corbet

@corbet @LWN Oh no, you are right. It was such an enticing idea.

I want to be search-indexed. I care less about VPN access from VPNs that just rent cloud IPs; much of my spam comes in that way anyway, and it's not clear that many site users actually do. If I can distinguish those I might add a lot more ASN blocks. 😢

0

0

0

René Mayrhofer 🇺🇦 🇹🇼

rene_mobile@infosec.exchange

Reply to @rene_mobile@infosec.exchange

@corbet @LWN After quite a bit of playing with different options, https://www.mayrhofer.eu.org/post/defenses-against-abusive-ai-scrapers/ is my current setup. I am going to watch my logs for the next couple of days to see what the scrapers get up to.

0

1

0

About social.kernel.org

Terms of service

Please do not use this service in violation of the Linux Kernel Code of Conduct. Doing so will result in your account suspension with the referral of the matter to the CoC committee.
"Repeating"/"boosting" someone else's status on this platform will be treated as endorsement and will fall under rule #1.
You are encouraged to use this platform to promote your work on the Linux Kernel, but there is no restriction on permitted topics (with the exception of anything covered by #1 above).
There is no requirement to post in English, but it should be considered the primary language of communication on this platform.

Privacy notice

The admins of this service have access to all posted statuses. They aren't looking, but if it's something they shouldn't know about, then you should not post it on this platform.

Please see the Linux Foundation Privacy Policy, which applies to this platform as well.

Getting your own account

If you would like an account on this instance, please check that the following applies to you:

You are listed in MAINTAINERS or CREDITS
OR: You have a kernel.org account or email address
OR: You have a long and established history of involvement with the Linux Kernel

If the above is true and you agree with the Terms of Service and Privacy Notice listed above, please use these instructions to request an account:

How to request an account on social.kernel.org