Advice on how to deal with AI bots/scrapers?

zoey@lemmy.librebun.com · 19 days ago

Advice on how to deal with AI bots/scrapers?

CronyAkatsuki@lemmy.cronyakatsuki.xyz · edit-2 19 days ago

Try crowdsec.

You can set it up with list’s that are updated frequetly and have it look at caddy proxy logs and then it can easilly block ai/bot like traffic.

I have it blocking over 100k ip’s at this moment.

https://www.crowdsec.net/

zoey@lemmy.librebun.com · 19 days ago

Not gonna lie, the $3900/mo at the top of the /pricing page is pretty wild.
Searched “crowdsec docker” and they have docs and all that. Thank you very much, I’ve heard of crowdsec before, but never paid much attention, absolutely will check this out!

K3CAN@lemmy.radio · 18 days ago

The paid plans get you the “premium” blocklists, which includes one specially made to prevent AI scrapers, but a free account will still get you the actual software, the community blocklist, plus up to three "basic"lists.

CronyAkatsuki@lemmy.cronyakatsuki.xyz · edit-2 17 days ago

And the comminity blocklists are updated when more than a couple ( I think the number is something like 10-50 ) instances of crowdsec block an ip in some fast timeframe.

The ai blocklist just adds IP when even one instance finds an AI trying to scrape right from the useragent.

So even if the community blocklist has fewer ai ip’s, it does eventually include them.

Starfarer@lemmy.today · 17 days ago

Which Crowd-Sec blocklists are you using?

CronyAkatsuki@lemmy.cronyakatsuki.xyz · 17 days ago

I’m using the default list alongside Firehol BotScout list and Firehol cybercrime tracker list set to ban.

Also using the Firehol cruzit.com list set to do captcha, just in case it’s not actually a bot.

I’m also using the cs-firewall-bouncer and a custom bouncer that’s shown on crowdsecs tutorials to detect privilege escalation for if anybody actually manages to get inside.

Alongside that I’m using a lot of scenario collection’s for specific software I’m using like nextcloud, grafana, ssh, … which helps a lot with attacks directly done on a service and not just general scraping or both path traversing.

All free and have been using it for a year, only complaint I have is that I had to make a cronjob to restart the crowdsec service every day because it would stop working after a couple days because of the amount of requests it has to process.

MangoPenguin@lemmy.blahaj.zone · 17 days ago

Crowdsec has default scenarios and lists that might block a lot of it, and you can pretty easily make a custom scenario to block IPs that cause large spikes of traffic to your applications if needed.

tomandjerryco@mander.xyz · 17 days ago

deleted by creator

breadsmasher@lemmy.world · 19 days ago

Im struggling to find it, but theres like an “AI tarpit” that causes scrapers to get stuck. something like that? Im sure I saw it posted on lemmy recently. hopefully someone can link it

zoey@lemmy.librebun.com · 19 days ago

I did find this github link as the first search result, looks interesting, thanks for letting me know the term “tar pit”.

zitrone 🍋@lemmings.world · 19 days ago

there is also https://forge.hackers.town/hackers.town/nepenthes

N0x0n@lemmy.ml · edit-2 18 days ago

Now I just want to host a web page and expose it with nepenthes…

First, because I’m a big fan of carnivorous plants.

Second, because it let’s you poison LLMs, AI and fuck with their data.

Lastly, because I can do my part and say F#CK Y0U to those privacy data hungry a$$holes !

I don’t even expose anything directly to the web (always accessible through a tunnel like wireguard) or have any important data to protect from AI or LLMs. But just giving the opportunity to fuck with them while they continuously harvest data from everyone is something I was already thinking off but didn’t knew how.

Thanks for the link !

_cryptagion [he/him]@lemmy.dbzer0.com · 18 days ago

If you’re looking to stop them from wasting your traffic, do not use a tarpit. The whole point of it is that it makes the scraper get stuck on your server forever. That means you pay for the traffic the scraper uses, and it will continually rack up those charges until the people running it wise up and ban your server. The question you gotta ask yourself is, who has more money, you or the massive AI corp?

Tarpits are the dumbest bit of anti-AI tech to come out yet.

rumba@lemmy.zip · 18 days ago

There’s more than one style of tar pit. In this case you obviously wouldn’t want to use an endless maze style.

What you want to do in this case is send them through an HA proxy that would redirect them on user agent, whenever they come in as Claude you send them over to a box running on a Wanem process at modem speeds.

They’ll immediately realize they’ve got a hug of death going on and give up.

_cryptagion [he/him]@lemmy.dbzer0.com · 18 days ago

If you had read the OP, they don’t want the scrapers using up all their traffic.

breadsmasher@lemmy.world · edit-2 18 days ago

yes i did read OP.

ed. i see this was downvoted without a response. But il put this out there anyway.

If you host a public site, which you expect anyone can access, there is very little you can do to exclude an AI scraper specifically.

Hosting your own site for personal use? IP blocks etc will prevent scraping.

But how do you identify legitimate users from scrapers? Its very difficult.

They will use your traffic up either way. Dont want that? You could waste their time (tarpit), or take your hosting away from public access.

Downvoter. Whats your alternative?

mel@jlai.lu · 19 days ago

I guess sending tar bombs can be fun

slazer2au@lemmy.world · 19 days ago

Go on.

orize@lemmy.dbzer0.com · 18 days ago

You first pick them up.

Then you throw them.

Classic!

SidewaysHighways@lemmy.world · 18 days ago

instructions unclear, tsar bomba away

Possibly linux@lemmy.zip · 18 days ago

Honestly we need some sort of proof of work (PoW)

Xanza@lemm.ee · 18 days ago

This is the most realistic solution. Adding a 0.5/1s PoW to hosted services isn’t gonna be a big deal for the end user, but offers a tiny bit of protection against bots, especially if the work factor is variable and escalates.

Possibly linux@lemmy.zip · 18 days ago

It also is practical for bots. It forces people to not abuse resources.

Xanza@lemm.ee · 18 days ago

There are a lot of crypto which increase workfactor PoW to combat spam. Nano is one of them, so it’s a pretty proven technology, too.

node815@lemmy.world · 17 days ago

You said

I’m only really running a caddy reverse proxy on the VPS which forwards my home server’s services through Tailscale. "

It seems then that you are using a Tailscale Funnel to expose your services to the public web. Is this the case? I ask because the basic premise of Tailscale is that you have to be logged into your Tailscale network to access the services and if you are not logged in, then the site you try to access won’t even appear to exist. Unless it’s setup via the Funnel.

Assuming then that you setup a funnel, then you are now 100% exposed to the WWW. AI Bots and bots in general crawl the WWW daily and eventually your site will be found. You have a few choices here, rely on a Web Application Firewall (WAF) such as Bunkerweb which would replace Caddy, but would provide a decent firewall of sorts. Or…you can use something like Config Server Firewall but I’m not sure if they have AI Bot protection. The last I used them was before AI was a thing.

drkt@scribe.disroot.org · 19 days ago

Build tar pits.

mholiv@lemmy.world · 19 days ago

They want to reduce the bandwidth usage. Not increase it!

𝕽𝖚𝖆𝖎𝖉𝖍𝖗𝖎𝖌𝖍@midwest.social · 19 days ago

A good tar pit will reduce your bandwidth. Tarpits aren’t about shoving useless data at bots; they’re about responding as slow as possible to keep the bot connected for as long as possible while giving it nothing.

Endlessh accepts the connection and then… does nothing. It doesn’t even actually perform the SSL negotiation. It just very… slowly… sends… an endless preamble, until the bot gives up.

As I write, my Internet-facing SSH tarpit currently has 27 clients trapped in it. A few of these have been connected for weeks. In one particular spike it had 1,378 clients trapped at once, lasting about 20 hours.

tuna@discuss.tchncs.de · 18 days ago

this might not be what you meant, but the word “tar” made me think of tar.gz. Don’t most sites compress the HTTP response body with gzip? What’s to stop you from sending a zip bomb over the network?

drkt@scribe.disroot.org · 18 days ago

Even if that was possible, I don’t want to crash innocents peoples browsers. My tar pits are deployed on live environments that normal users could find themselves navigating to and it’s overkill when if you simply respond to 404 Not Found with 200 OK and serve 15MB on the “error” page then bots will stop going to your site because you’re not important enough to deal with. It’s a low bar, but your data isn’t worth someone looking at your tactics and even thinking about circumventing it. They just stop attacking you.