Conversation

K. Ryabitsev 🍁

I know I haven't been able to work on b4 and other tooling as much as I was hoping, but between the Equinix exodus, having to continuously mitigate against LLM bot DDoS'ing our infra, and just general geopolitical sh*t that lives rent-free in my head... it's been difficult. But I have high hopes and lots of good ideas -- that's got to count for something, right?
1
6
25
FYI, Drew isn't making it up in this article. At any given time, if you check what I'm doing, chances are I'm trying to figure out ways to deal with bots.

https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html
4
52
44

@monsieuricon have you tried routing known bots to a garbage generator to contain them?

To my surprise, doing that had a massive impact on my infra, in a good way. Ratelimiting known bots, serving them a static page didn't work, they came back in disguise. Serving them small amounts of cheaply generated garbage in the form of an infinite maze seems to have satisfied them.

Doesn't catch all of them, but my backend suddenly sees a lot less traffic, about 4-5 million requests less a day - and I'm just a nobody, that's 99% of my traffic, which never reaches the backend.

The perpetrators are the same, though, so maybe something similar might help you, too?

1
0
0

@monsieuricon can these methods be made public, or could that lead to the assholes becoming bigger assholes?

1
0
0
@algernon The gist of the problem is that it is impossible to identify "known bots." Yeah, there's a subset of requests that clearly identify themselves as "LLMWnatnotBot 1.x", but if you read Drew's article, the vast majority of traffic is one-two requests from random IPs with generic browser user-agents. There is no reliable way of telling them apart from legitimate requests. The only viable solution is to put everything behind CloudFlare or Fastly or Akamai and let them protect you against bot traffic, but *that is not a win*. That's capitulating and admitting that the open web has failed.
2
0
4

@monsieuricon But our investors and shareholders! You can't block us, think about our profit.

On a serious note: Is it possible to identify who the offending companies are? And somehow take the war to their turf?

1
0
0

@monsieuricon I read it, and that was my experience too (unidentifiable agents from random ips), while I was trying to block them, or serve them static content. The moment I started serving known agents infinite garbage, the unidentifiable random ips pretty much disappeared. That was surprising, but its been like that for two months straight now.

(I did have to special case Alibaba, because they never identified themselves, so I match them by range, which is a bit unfortunate, but sadly neccessary.)

0
0
0
@mariusor Everything that used to work no longer does. 🀷 First, we rate-limited by IP, but they switched to using public cloud farms. Next, we banned based on user-agent, but they started using a generic user-agent. Then, we started banning on "the same" user agent per number of requests, but that never really worked very well, and they switched to varied user-agents. Next, we started banning whole subnets and ASNs, but they switched to using residential IPs. This is where we are now -- bots descend on your public resource from tens of thousands of IPs from all over the world, with reasonably recent, varied user-agents, with any one IP sending no more than 1-2 requests. It's clearly all bot traffic, because there's clearly nobody who is going to be suddenly interested in random commits from 5 years ago, or in random conversations on linux-fsdevel from 9 years ago, but it's impossible to turn this logic into a reliable "no, you are a bot, go away" action without turning to fronting services or various anti-bot captchas.
1
4
8
@rails There is not. There is, in fact, no reliable way to identify legitimate requests from bot traffic if you're only looking at logs or packets. The only way to reliably tell is by getting yourself into the page rendering client. E.g. this is what happens when you get CloudFlare's "prove you're not a bot" screen -- they use javascript to collect information about your browser and to watch the pointer behaviour to figure out if you're a bot or not (plus, massive amounts of data they have internally on your IP address).
1
0
0

@monsieuricon @rails would some combination of proof-of-work challenge like Xe’s Anubis, and, for other protocols (or for legitimate clients not using javascript), requiring authentication, be an acceptable tradeoff? I know it’s making the normal users pay the price for the bot traffic, but maybe it’s until it dies down?
(I have no technical expertise in the matter, so it may not make any sense)

1
0
0
@esgariot @rails Yes, it would work, but would it be acceptable trade-off? That's not clear. Right now, I'm leaning towards setting up separate, authentication-required duplicates for some services that I can give to maintainers and developers, but that, again, is capitulating and admitting that the open web has failed.
1
0
2

@monsieuricon @mariusor There is always the option to poison your pages with links invisible to users that would link crawlers to an infinite amount of random BS, no?

0
0
0

@monsieuricon @rails @esgariot At some point it feels like it will become admin-time effective to switch from IP blocklist to allowlist. Though, yes, it still gives up on the open web. :(

0
0
0

@monsieuricon Same here, really. Fortunately it seems blocking Chrome and Firefox user agents that are seriously out of date by years seems to have addressed the problem for MacPorts for now, but who knows how long this will actually last.

Unfortunately this affects a few legitimate users on legacy systems, too, but that's a price I'm afraid I have to pay now.

0
0
1

@monsieuricon @algernon what if you just add links in every page to slow them down? I have an instance of Nepenthes that served over 1M pages so far this month from a single link in a personal GH page with literally no traffic (and robots.txt disallow crawl).

Even if they still crawl the real pages as long as they fall for the tarpit URLs they'll waste way more time there, not counting the satisfaction of polluting their model.

1
0
0