@corbet I think you're right; it might be worth looking up a few in https://www.spamhaus.org/ip-reputation/ to see if any have been used in spamming, as well
@corbet SourceHut has been having problems too. Might try talking to them to see how they mitigated it
@corbet IP based blocks have been useless for decades. Block behaviors. Mostl bots cost money to run via bot net rental fees.
@corbet `User-Agent: libcurl/Oral B 38.01`
@monsieuricon @corbet so you know the behavior and the pattern. Construct countermeasures. I'm honestly astounded to see guys close to the kernel unable to do this. Think like your opponent. Find his weak spots. Nothing has changed since Sun Tzu made his observations. All bots have weak spots.
I personally have had success with Geo blocking. I have some custom caching as well to make sure that the lookup for a given IP doesn't get run more than once per hour.
I know where my area of service is and any requests from outside it are silently dropped without a reply. This has cut down on automated bots almost completely.
@corbet @monsieuricon your response is revealing. No wonder you aren't getting anywhere. I'll try to explain in general terms.
You are in a game. You have to respect your opponent. If he is smarter or more able than you, you have to find someone capable of playing better than him. This is a private game, so rule 1:
Stop talking about this in public! This is a private game. It is not open source. Don't say what you know. Don't reveal what you learn.
1/
Read the Art of War. Really.
@corbet @monsieuricon to win this game requires understanding that, again, you respect your opponent. I can tell you I know at least one guy who refuses to deal with Linux kernel people anymore because he's very smart. He used to. If Linus has driven real hackers away then... Not good.
Rule 2: think outside the box. That's where your opponent plays. So that's where you should be. There is no greater compliment in my experience than from a skilled opponent. Nothing comes close. See respect.
2/
@corbet @monsieuricon thinking I can whip out a solution shows you don't understand the game and don't respect your opponent. 1st, I'm tired of that game, and 2nd, if I have to play it I get paid.
Rule 3: groupthink is hacker death. It's a private game. He's free to do what he wants, so that has to be matched.
Are you seriously telling me you know 0 great hackers in your world? Sometimes you don't need to be great, just stubborn and persistent. Qualities very rare in the corporate sector.
3/
@corbet @monsieuricon good luck. Though really luck has very little to do with it. But do get rid of the public lkml mindset. Is your opponent telling everyone what's going on? So why are you? When I was new to this I'd let them know they'd been trapped but realized how dumb that was. Then I'd do something subtle. Now it's totally transparent.
There's always decisions to make. How much does a false positive matter? Though if the weakness is detected there will be few false results.
4/
@corbet I see similar patterns trying to log in to my exim mail server. All of them already know valid user IDs on my server to attempt, and the same IP address rarely repeats (certainly not after I set up fail2ban). Seems like a large network is sharing notes about my server.
@corbet Try to lose them in an AI poisoning maze running on a different host to the real content? Robots.txt blocks permitted crawlers from visiting the maze but every real page contains a link into the maze. Maze is much bigger than real site so should eat higher proportion of traffic, but heavily throttled as no real users to worry about. Logs from maze give you IPs to block from the real site, though as you note that doesn't help much if they only visit 3 times each.
@corbet I recently dealt with something similar, a million IP addresses scraping my gitweb
I wonder if you're seeing hits to individual comment pages on LWN? That would match the pattern I was seeing.
How I dealt with it was noticing that the bots were hitting deep links that were only occasionally used by regular users. (Like LWN comment pages I imagine.) So a small degradation in an edge case that affected regular users was acceptable.
So I rate limited access to those urls. When over rate limit, my server serves up a response that says "Please wait..." and refreshes after a few seconds. The bots didn't wait to refresh. When the server is not being swarmed by bots, regular users will see only a brief interruption. This cleared the botswarm for me in a couple of days, so it was apparently only incompetent spidering and not malicious.
I could imagine you doing something similar with lwn.net/Articles/* pages of type comment, and similarly not affecting most LWN users most of the time.
@corbet I'm trying to think of the AI training that would be using compromised hosts for scraping; I thought for training you had to do the training part on one or a small number of tightly coupled hosts; so then what is it?
@corbet That's what confuses me - an org that has a large enough mothership to do AI training, but illicit enough to have to use a botnet to gather the data.
@corbet So, @artandtechnic has been waging a war against these folks for his site and might have some ideas.
@penguin42 @corbet I mean, a lot of the money around LLM training has challenges with basic ethics and consent, so as more organizations have taken to blocking OpenAI and other scrapers they may be spending time making their scrapers stealthy and harder to distinguish from normal client requests, but aside from "legit" companies one of the most lucrative use cases for LLM-generated slop is spam/con/fraud which is primarily run by global organized crime groups, so maybe more established operators are kicking out obvious criminal groups and they want their own infrastructure that has fewer ethical constraints than OpenAI, and are using their botnet fleet to do it. Running "pig butchering" scams at higher scale using LLMs rather than enslaved humans, since human trafficking draws attention from the authorities.
@DanielRThomas @corbet I just saw an article about something like this called Nepenthes https://www.404media.co/developer-creates-infinite-maze-to-trap-ai-crawlers-in/ I'm not sure how well it'll work here though because each crawler source is only used for a handful of requests before moving on, so it's hard to ID on so few requests to serve a tarpit that doesn't trap legit users. Maybe combined with something like @joeyh proposed where a page with bogus honeypot links is served as an interstitial page in place of less common documents with a auto-refresh to the legit document, so the scraper starts seeding its queue of requests with links that you know are bogus, so can be an instant trap for future requests. Ideally a normal client would just follow the meta-refresh and the user-agent wouldn't try to pre-load the booby-trapped URLs, but this would require some validation, and for the LWN crowd that might include user-agents such as elinks, lynx and dillo not just Firefox and Chrome.
@tristan957 @corbet sharing log analysis, blocked IP ranges and techniques with other FOSS site operators so that they all can benefit and not have to figure this out on their own is a great idea. Maybe a spamhaus RBL-like service could be created as a shared resource for participating members.
Tangent: but it's kind of funny given how much time and effort has been spent in the last couple decades in high-volume, low-latency scalable distributed key/value database design at big tech companies, it turns out that (within some constraints) *DNS* is the best key/value store out there.
@raven667 @corbet @joeyh Yes something like that, or even just putting links on pages with robots.txt set such that well behaved crawlers ignore them. Longer list of options here: https://tldr.nettime.org/@asrg/113867412641585520