@corbet I think you're right; it might be worth looking up a few in https://www.spamhaus.org/ip-reputation/ to see if any have been used in spamming, as well
@corbet SourceHut has been having problems too. Might try talking to them to see how they mitigated it
@corbet IP based blocks have been useless for decades. Block behaviors. Mostl bots cost money to run via bot net rental fees.
@monsieuricon @corbet so you know the behavior and the pattern. Construct countermeasures. I'm honestly astounded to see guys close to the kernel unable to do this. Think like your opponent. Find his weak spots. Nothing has changed since Sun Tzu made his observations. All bots have weak spots.
I personally have had success with Geo blocking. I have some custom caching as well to make sure that the lookup for a given IP doesn't get run more than once per hour.
I know where my area of service is and any requests from outside it are silently dropped without a reply. This has cut down on automated bots almost completely.
@corbet @monsieuricon your response is revealing. No wonder you aren't getting anywhere. I'll try to explain in general terms.
You are in a game. You have to respect your opponent. If he is smarter or more able than you, you have to find someone capable of playing better than him. This is a private game, so rule 1:
Stop talking about this in public! This is a private game. It is not open source. Don't say what you know. Don't reveal what you learn.
1/
Read the Art of War. Really.
@corbet @monsieuricon to win this game requires understanding that, again, you respect your opponent. I can tell you I know at least one guy who refuses to deal with Linux kernel people anymore because he's very smart. He used to. If Linus has driven real hackers away then... Not good.
Rule 2: think outside the box. That's where your opponent plays. So that's where you should be. There is no greater compliment in my experience than from a skilled opponent. Nothing comes close. See respect.
2/
@corbet @monsieuricon thinking I can whip out a solution shows you don't understand the game and don't respect your opponent. 1st, I'm tired of that game, and 2nd, if I have to play it I get paid.
Rule 3: groupthink is hacker death. It's a private game. He's free to do what he wants, so that has to be matched.
Are you seriously telling me you know 0 great hackers in your world? Sometimes you don't need to be great, just stubborn and persistent. Qualities very rare in the corporate sector.
3/
@corbet @monsieuricon good luck. Though really luck has very little to do with it. But do get rid of the public lkml mindset. Is your opponent telling everyone what's going on? So why are you? When I was new to this I'd let them know they'd been trapped but realized how dumb that was. Then I'd do something subtle. Now it's totally transparent.
There's always decisions to make. How much does a false positive matter? Though if the weakness is detected there will be few false results.
4/
@corbet I see similar patterns trying to log in to my exim mail server. All of them already know valid user IDs on my server to attempt, and the same IP address rarely repeats (certainly not after I set up fail2ban). Seems like a large network is sharing notes about my server.
@corbet Try to lose them in an AI poisoning maze running on a different host to the real content? Robots.txt blocks permitted crawlers from visiting the maze but every real page contains a link into the maze. Maze is much bigger than real site so should eat higher proportion of traffic, but heavily throttled as no real users to worry about. Logs from maze give you IPs to block from the real site, though as you note that doesn't help much if they only visit 3 times each.
@corbet I recently dealt with something similar, a million IP addresses scraping my gitweb
I wonder if you're seeing hits to individual comment pages on LWN? That would match the pattern I was seeing.
How I dealt with it was noticing that the bots were hitting deep links that were only occasionally used by regular users. (Like LWN comment pages I imagine.) So a small degradation in an edge case that affected regular users was acceptable.
So I rate limited access to those urls. When over rate limit, my server serves up a response that says "Please wait..." and refreshes after a few seconds. The bots didn't wait to refresh. When the server is not being swarmed by bots, regular users will see only a brief interruption. This cleared the botswarm for me in a couple of days, so it was apparently only incompetent spidering and not malicious.
I could imagine you doing something similar with lwn.net/Articles/* pages of type comment, and similarly not affecting most LWN users most of the time.
@corbet I'm trying to think of the AI training that would be using compromised hosts for scraping; I thought for training you had to do the training part on one or a small number of tightly coupled hosts; so then what is it?
@corbet That's what confuses me - an org that has a large enough mothership to do AI training, but illicit enough to have to use a botnet to gather the data.