I'm having trouble figuring out what kind of botnet has been hammering our web servers over the past week. Requests come in from tens of thousands of addresses, just once or twice each (and not getting blocked by fail2ban), with different browser strings (Chrome versions ranging from 24.0.1292.0 - 108.0.5163.147) and ridiculous cobbled-together paths like /about-us/1-2-3-to-the-zoo/the-tiny-seed/10-little-rubber-ducks/1-2-3-to-the-zoo/the-tiny-seed/the-nonsense-show/slowly-slowly-slowly-said-the-sloth/the-boastful-fisherman/the-boastful-fisherman/brown-bear-brown-bear-what-do-you-see/the-boastful-fisherman/brown-bear-brown-bear-what-do-you-see/brown-bear-brown-bear-what-do-you-see/pancakes-pancakes/pancakes-pancakes/the-tiny-seed/pancakes-pancakes/pancakes-pancakes/slowly-slowly-slowly-said-the-sloth/the-tiny-seed
(I just put together a bunch of Eric Carle titles as an example. The actual paths are pasted together from valid paths on our server but in invalid order, with as many as 32 subdirectories.)
Has anyone else been seeing this and do you have an idea what's behind it?
@linuxandyarn AI crawlers. And yes, we see them all. They use embedded libraries in a lot of ad enabled apps to hide behind domestic IPs. See https://jan.wildeboer.net/2025/02/Blocking-Stealthy-Botnets/ for my experience with them when they are used to brute force mail servers.
@jwildeboer I wondered, but since they're not being as "friendly" as ClaudeBot or PetalBot by identifying themselves they've been much harder to manage. I also thought a malicious browser plugin could be involved.
@linuxandyarn They don't give a shit about robots.txt and are hard to identify as they randomly change their browser identification. It's Wild West out there when it comes to collecting training data for LLMs (Large Language Models), done by 3rd parties that are not directly related to big AI providers. It's a shady market.
@jwildeboer If they're mobile apps then presumably most of them will be behind CGNAT so even one device on an ASN will likely seem to have multiple IPv4 addresses (e.g., the 4 per ASN you've seen).
Also, that might be why they tend not to use v6 as a device would, presumably, have a stable address for at least a few hours.
@edavies @linuxandyarn Yep. IPv6 is still at less than 2% of what my servers qualify as abusive behaviour.