One thing that I really do not understand about the AI scraperbot plague is why these campaigns are so disruptive.
Why do the people running these damn things feel that they need to slurp down every resource with so much urgency that it is effectively DDoS'ing the sites they're targeting?
I don't love feeding AI training sets to begin with, but if it weren't hammering sites to the point of non-responsiveness it would be tolerable-ish. I mean, I surely wouldn't care if someone was scraping LWN or my sites gently to do academic research or something so long as it wasn't affecting anything.
@jzb "With both the Unlocker API and the SERP API, you are charged only for successful requests to your target domain, ensuring cost efficiency and maximum value for every API interaction." https://docs.brightdata.com/scraping-automation/web-unlocker/introduction
They have an incentive to launch multiple requests for the same resource from different IP addresses to maximize chances of getting paid. Extra requests are free to the crawler because they got someone to "consent" to the use of their device and residential connection
@dmarti I do not have any kind words for Bright Data. Not a one.
@jzb Linux article idea: how to block Bright Data's C&C on the router (in case people's family members or guests install some software with the Bright Data SDK)
(I'm not qualified to write this but would read it. Eventually the residential ISPs who are giving free bandwidth to Bright Data will copy it for the routers they distribute, too)
@jzb Not only that, but they do it repeatedly. Sometimes daily. And that’s for my tiny, irrelevant, personal site, which I last updated 2 years ago.
Maybe it is the time to fight back through creative ways. If all you have as a resource is a mindshare in the geek community, let's try to use that.
For example a kickstarter/Open Collective crowdfunding campaign to fund the investigative-style article on Bright Data which would then be published on LWN?
I have no idea how much it could cost, but I would donate for the cause. And I believe many others will.
@corbet @jzb Yes, also the surveillance marketers are using your IP address to profile you, so you would tend to want to generate a traffic profile that shows your station in life is high enough to be entitled to the good deals—and you are young and hot enough not to be targeted for the Boomer scams, but old enough not to get the gambling and BNPL schemes the kids get
@jzb I wonder if they're making a connection successfully, then DDOSing the site to "jam the radio-waves" against others. If they were, that could probably be found by logging tcp start/end by source address.