Conversation

One thing that I really do not understand about the AI scraperbot plague is why these campaigns are so disruptive.

Why do the people running these damn things feel that they need to slurp down every resource with so much urgency that it is effectively DDoS'ing the sites they're targeting?

I don't love feeding AI training sets to begin with, but if it weren't hammering sites to the point of non-responsiveness it would be tolerable-ish. I mean, I surely wouldn't care if someone was scraping LWN or my sites gently to do academic research or something so long as it wasn't affecting anything.

3
1
0

@jzb "With both the Unlocker API and the SERP API, you are charged only for successful requests to your target domain, ensuring cost efficiency and maximum value for every API interaction." https://docs.brightdata.com/scraping-automation/web-unlocker/introduction

They have an incentive to launch multiple requests for the same resource from different IP addresses to maximize chances of getting paid. Extra requests are free to the crawler because they got someone to "consent" to the use of their device and residential connection

1
0
0

@dmarti I do not have any kind words for Bright Data. Not a one.

1
0
0

@jzb Linux article idea: how to block Bright Data's C&C on the router (in case people's family members or guests install some software with the Bright Data SDK)

(I'm not qualified to write this but would read it. Eventually the residential ISPs who are giving free bandwidth to Bright Data will copy it for the routers they distribute, too)

2
0
0
@dmarti @jzb That is indeed an interesting thought. Of course, there's more than just Bright Data out there... Another idea might be an app you could put on a phone that would tell you how much your device is being used to attack others.
1
0
2

@jzb Not only that, but they do it repeatedly. Sometimes daily. And that’s for my tiny, irrelevant, personal site, which I last updated 2 years ago.

0
0
0

@dmarti @jzb

Maybe it is the time to fight back through creative ways. If all you have as a resource is a mindshare in the geek community, let's try to use that.

For example a kickstarter/Open Collective crowdfunding campaign to fund the investigative-style article on Bright Data which would then be published on LWN?

I have no idea how much it could cost, but I would donate for the cause. And I believe many others will.

1
0
0

@corbet @jzb Yes, also the surveillance marketers are using your IP address to profile you, so you would tend to want to generate a traffic profile that shows your station in life is high enough to be entitled to the good deals—and you are young and hot enough not to be targeted for the Boomer scams, but old enough not to get the gambling and BNPL schemes the kids get

0
0
0

@jzb I wonder if they're making a connection successfully, then DDOSing the site to "jam the radio-waves" against others. If they were, that could probably be found by logging tcp start/end by source address.

0
0
0

@bookwar @jzb I'm going to SCALE in March and will make a point of checking out their OpenWRT setup. If there is a documented way to do AI scraper blocking in OpenWRT it could make its way to other routers (ISPs could save significant $ by giving away less bandwidth)

https://github.com/socallinuxexpo/scale-network

0
0
0