@monsieuricon have you tried routing known bots to a garbage generator to contain them?
To my surprise, doing that had a massive impact on my infra, in a good way. Ratelimiting known bots, serving them a static page didn't work, they came back in disguise. Serving them small amounts of cheaply generated garbage in the form of an infinite maze seems to have satisfied them.
Doesn't catch all of them, but my backend suddenly sees a lot less traffic, about 4-5 million requests less a day - and I'm just a nobody, that's 99% of my traffic, which never reaches the backend.
The perpetrators are the same, though, so maybe something similar might help you, too?
@monsieuricon can these methods be made public, or could that lead to the assholes becoming bigger assholes?
@monsieuricon But our investors and shareholders! You can't block us, think about our profit.
On a serious note: Is it possible to identify who the offending companies are? And somehow take the war to their turf?
@monsieuricon I read it, and that was my experience too (unidentifiable agents from random ips), while I was trying to block them, or serve them static content. The moment I started serving known agents infinite garbage, the unidentifiable random ips pretty much disappeared. That was surprising, but its been like that for two months straight now.
(I did have to special case Alibaba, because they never identified themselves, so I match them by range, which is a bit unfortunate, but sadly neccessary.)
@monsieuricon @rails would some combination of proof-of-work challenge like Xeβs Anubis, and, for other protocols (or for legitimate clients not using javascript), requiring authentication, be an acceptable tradeoff? I know itβs making the normal users pay the price for the bot traffic, but maybe itβs until it dies down?
(I have no technical expertise in the matter, so it may not make any sense)
@monsieuricon @mariusor There is always the option to poison your pages with links invisible to users that would link crawlers to an infinite amount of random BS, no?
@monsieuricon @rails @esgariot At some point it feels like it will become admin-time effective to switch from IP blocklist to allowlist. Though, yes, it still gives up on the open web. :(
@monsieuricon Same here, really. Fortunately it seems blocking Chrome and Firefox user agents that are seriously out of date by years seems to have addressed the problem for MacPorts for now, but who knows how long this will actually last.
Unfortunately this affects a few legitimate users on legacy systems, too, but that's a price I'm afraid I have to pay now.
@monsieuricon @algernon what if you just add links in every page to slow them down? I have an instance of Nepenthes that served over 1M pages so far this month from a single link in a personal GH page with literally no traffic (and robots.txt disallow crawl).
Even if they still crawl the real pages as long as they fall for the tarpit URLs they'll waste way more time there, not counting the satisfaction of polluting their model.
@monsieuricon @algernon I suggested that on a thread about the same article...
https://noc.social/@dermoth/114189890006499596