social.kernel.org

Conversation

Jonathan Corbet

corbet

A followup for folks who are curious about the whole AI botswarm problem...

Some of these bots are clearly running on a bunch of machines on the same net. I have been able to reduce the traffic significantly by treating everything as a class-C net and doing subnet-level throttling. That and simply blocking a couple of them.

But that leaves a lot of traffic with an interesting characteristic: there are millions of obvious bot hits (following a pattern through the site, for example) that all come from a different IP. An access log with 9M lines as over 1M IP addresses, and few of them appear more than about three times.

So these things are running on widely distributed botnets, likely on compromised computers, and they are doing their best to evade any sort of recognition or throttling. I don't think that any sort of throttling or database of known-bot IPs is going to help here...not quite sure what to do about it.

What a world we have made for ourselves...

11

41

51

Justin Mason

jmason@mastodon.ie

Reply to @corbet

@corbet I think you're right; it might be worth looking up a few in https://www.spamhaus.org/ip-reputation/ to see if any have been used in spamming, as well

0

0

0

Paul Martin

nowster@fedi.nowster.me.uk

Reply to @corbet

@corbet@social.kernel.org Distributed denial of sensibility.

0

0

0

Tristan Partin

tristan957@fosstodon.org

Reply to @corbet

@corbet SourceHut has been having problems too. Might try talking to them to see how they mitigated it

1

0

0

smxi

smxi@fosstodon.org

Reply to @corbet

@corbet IP based blocks have been useless for decades. Block behaviors. Mostl bots cost money to run via bot net rental fees.

1

0

0

"Musty Bits" McGee

arichtman@eigenmagic.net

Reply to @corbet

@corbet `User-Agent: libcurl/Oral B 38.01`

0

0

0

K. Ryabitsev 🍁

Reply to @smxi@fosstodon.org

@smxi @corbet we're kinda trying to tell you that a single IP will hit 2-3 times an hour or so. You can't do behavioural analysis over 3 hits. They request 2-3 specific URLs with generic browser client strings and then aren't seen again. But multiply this by tens of thousands of IPs all coming from different subnets and you have a problem.

1

0

6

smxi

smxi@fosstodon.org

Reply to @monsieuricon

@monsieuricon @corbet so you know the behavior and the pattern. Construct countermeasures. I'm honestly astounded to see guys close to the kernel unable to do this. Think like your opponent. Find his weak spots. Nothing has changed since Sun Tzu made his observations. All bots have weak spots.

1

0

0

rastilin

rastilin@aus.social

Reply to @corbet

I personally have had success with Geo blocking. I have some custom caching as well to make sure that the lookup for a given IP doesn't get run more than once per hour.

I know where my area of service is and any requests from outside it are silently dropped without a reply. This has cut down on automated bots almost completely.

0

0

0

Jonathan Corbet

corbet

Reply to @smxi@fosstodon.org

@smxi @monsieuricon Suggestions for these countermeasures - and how to apply them without hosing legitimate users - would be much appreciated. I'm glad they are obvious to you, please do share!

1

0

8

smxi

smxi@fosstodon.org

Reply to @corbet

@corbet @monsieuricon your response is revealing. No wonder you aren't getting anywhere. I'll try to explain in general terms.

You are in a game. You have to respect your opponent. If he is smarter or more able than you, you have to find someone capable of playing better than him. This is a private game, so rule 1:

Stop talking about this in public! This is a private game. It is not open source. Don't say what you know. Don't reveal what you learn.
1/

Read the Art of War. Really.

1

0

0

smxi

smxi@fosstodon.org

Reply to @smxi@fosstodon.org

@corbet @monsieuricon to win this game requires understanding that, again, you respect your opponent. I can tell you I know at least one guy who refuses to deal with Linux kernel people anymore because he's very smart. He used to. If Linus has driven real hackers away then... Not good.

Rule 2: think outside the box. That's where your opponent plays. So that's where you should be. There is no greater compliment in my experience than from a skilled opponent. Nothing comes close. See respect.
2/

1

0

0

smxi

smxi@fosstodon.org

Reply to @smxi@fosstodon.org

@corbet @monsieuricon thinking I can whip out a solution shows you don't understand the game and don't respect your opponent. 1st, I'm tired of that game, and 2nd, if I have to play it I get paid.

Rule 3: groupthink is hacker death. It's a private game. He's free to do what he wants, so that has to be matched.

Are you seriously telling me you know 0 great hackers in your world? Sometimes you don't need to be great, just stubborn and persistent. Qualities very rare in the corporate sector.
3/

1

0

0

smxi

smxi@fosstodon.org

Reply to @smxi@fosstodon.org

@corbet @monsieuricon good luck. Though really luck has very little to do with it. But do get rid of the public lkml mindset. Is your opponent telling everyone what's going on? So why are you? When I was new to this I'd let them know they'd been trapped but realized how dumb that was. Then I'd do something subtle. Now it's totally transparent.

There's always decisions to make. How much does a false positive matter? Though if the weakness is detected there will be few false results.

4/

0

0

0

Brad Martin

bmartin427@techhub.social

Reply to @corbet

@corbet I see similar patterns trying to log in to my exim mail server. All of them already know valid user IDs on my server to attempt, and the same IP address rarely repeats (certainly not after I set up fail2ban). Seems like a large network is sharing notes about my server.

0

0

0

Daniel Thomas

DanielRThomas@social.coop

Reply to @corbet

@corbet Try to lose them in an AI poisoning maze running on a different host to the real content? Robots.txt blocks permitted crawlers from visiting the maze but every real page contains a link into the maze. Maze is much bigger than real site so should eat higher proportion of traffic, but heavily throttled as no real users to worry about. Logs from maze give you IPs to block from the real site, though as you note that doesn't help much if they only visit 3 times each.

1

0

0

joeyh@hachyderm.io

Reply to @corbet

@corbet I recently dealt with something similar, a million IP addresses scraping my gitweb

I wonder if you're seeing hits to individual comment pages on LWN? That would match the pattern I was seeing.

How I dealt with it was noticing that the bots were hitting deep links that were only occasionally used by regular users. (Like LWN comment pages I imagine.) So a small degradation in an edge case that affected regular users was acceptable.

So I rate limited access to those urls. When over rate limit, my server serves up a response that says "Please wait..." and refreshes after a few seconds. The bots didn't wait to refresh. When the server is not being swarmed by bots, regular users will see only a brief interruption. This cleared the botswarm for me in a couple of days, so it was apparently only incompetent spidering and not malicious.

I could imagine you doing something similar with lwn.net/Articles/* pages of type comment, and similarly not affecting most LWN users most of the time.

0

0

0

penguin42

penguin42@mastodon.org.uk

Reply to @corbet

@corbet I'm trying to think of the AI training that would be using compromised hosts for scraping; I thought for training you had to do the training part on one or a small number of tightly coupled hosts; so then what is it?

1

0

0

Jonathan Corbet

corbet

Reply to @penguin42@mastodon.org.uk

@penguin42 They don't tell me what they are doing with the data... the distributed scraping is an easily observable fact, though. Perhaps they are firehosing the data back to the mothership for training?

1

0

0

penguin42

penguin42@mastodon.org.uk

Reply to @corbet

@corbet That's what confuses me - an org that has a large enough mothership to do AI training, but illicit enough to have to use a botnet to gather the data.

1

0

0

keen456 keen456

keen456@infosec.exchange

Reply to @corbet

@corbet So, @artandtechnic has been waging a war against these folks for his site and might have some ideas.

0

0

0

Raven667

raven667@hachyderm.io

Reply to @penguin42@mastodon.org.uk

@penguin42 @corbet I mean, a lot of the money around LLM training has challenges with basic ethics and consent, so as more organizations have taken to blocking OpenAI and other scrapers they may be spending time making their scrapers stealthy and harder to distinguish from normal client requests, but aside from "legit" companies one of the most lucrative use cases for LLM-generated slop is spam/con/fraud which is primarily run by global organized crime groups, so maybe more established operators are kicking out obvious criminal groups and they want their own infrastructure that has fewer ethical constraints than OpenAI, and are using their botnet fleet to do it. Running "pig butchering" scams at higher scale using LLMs rather than enslaved humans, since human trafficking draws attention from the authorities.

0

0

0

Raven667

raven667@hachyderm.io

Reply to @DanielRThomas@social.coop

@DanielRThomas @corbet I just saw an article about something like this called Nepenthes https://www.404media.co/developer-creates-infinite-maze-to-trap-ai-crawlers-in/ I'm not sure how well it'll work here though because each crawler source is only used for a handful of requests before moving on, so it's hard to ID on so few requests to serve a tarpit that doesn't trap legit users. Maybe combined with something like @joeyh proposed where a page with bogus honeypot links is served as an interstitial page in place of less common documents with a auto-refresh to the legit document, so the scraper starts seeding its queue of requests with links that you know are bogus, so can be an instant trap for future requests. Ideally a normal client would just follow the meta-refresh and the user-agent wouldn't try to pre-load the booby-trapped URLs, but this would require some validation, and for the LWN crowd that might include user-agents such as elinks, lynx and dillo not just Firefox and Chrome.

1

0

0

Raven667

raven667@hachyderm.io

Reply to @tristan957@fosstodon.org

@tristan957 @corbet sharing log analysis, blocked IP ranges and techniques with other FOSS site operators so that they all can benefit and not have to figure this out on their own is a great idea. Maybe a spamhaus RBL-like service could be created as a shared resource for participating members.

Tangent: but it's kind of funny given how much time and effort has been spent in the last couple decades in high-volume, low-latency scalable distributed key/value database design at big tech companies, it turns out that (within some constraints) *DNS* is the best key/value store out there.

0

0

0

Daniel Thomas

DanielRThomas@social.coop

Reply to @raven667@hachyderm.io

@raven667 @corbet @joeyh Yes something like that, or even just putting links on pages with robots.txt set such that well behaved crawlers ignore them. Longer list of options here: https://tldr.nettime.org/@asrg/113867412641585520

0

0

0

About social.kernel.org

Terms of service

Please do not use this service in violation of the Linux Kernel Code of Conduct. Doing so will result in your account suspension with the referral of the matter to the CoC committee.
"Repeating"/"boosting" someone else's status on this platform will be treated as endorsement and will fall under rule #1.
You are encouraged to use this platform to promote your work on the Linux Kernel, but there is no restriction on permitted topics (with the exception of anything covered by #1 above).
There is no requirement to post in English, but it should be considered the primary language of communication on this platform.

Privacy notice

The admins of this service have access to all posted statuses. They aren't looking, but if it's something they shouldn't know about, then you should not post it on this platform.

Please see the Linux Foundation Privacy Policy, which applies to this platform as well.

Getting your own account

If you would like an account on this instance, please check that the following applies to you:

You are listed in MAINTAINERS or CREDITS
OR: You have a kernel.org account or email address
OR: You have a long and established history of involvement with the Linux Kernel

If the above is true and you agree with the Terms of Service and Privacy Notice listed above, please use these instructions to request an account:

How to request an account on social.kernel.org