Long story short, my VPS, which I’m forwarding my servers through Tailscale to, got hammered by thousands of requests per minute from Anthropic’s Claude AI. All of which being from different AWS IPs.
The VPS has a 1TB monthly cap, but it’s still kinda shitty to have huge spikes like the 13GB in just a couple of minutes today.
How do you deal with something like this?
I’m only really running a caddy reverse proxy on the VPS which forwards my home server’s services through Tailscale. "
I’d really like to avoid solutions like Cloudflare, since they f over CGNAT users very frequently and all that. Don’t think a WAF would help with this at all(?), but rate limiting on the reverse proxy might work.
(VPS has fail2ban and I’m using /etc/hosts.deny for manual blocking. There’s a WIP website on my root domain with robots.txt that should be denying AWS bots as well…)
I’m still learning and would really appreciate any suggestions.
Try crowdsec.
You can set it up with list’s that are updated frequetly and have it look at caddy proxy logs and then it can easilly block ai/bot like traffic.
I have it blocking over 100k ip’s at this moment.
Not gonna lie, the $3900/mo at the top of the /pricing page is pretty wild.
Searched “crowdsec docker” and they have docs and all that. Thank you very much, I’ve heard of crowdsec before, but never paid much attention, absolutely will check this out!The paid plans get you the “premium” blocklists, which includes one specially made to prevent AI scrapers, but a free account will still get you the actual software, the community blocklist, plus up to three "basic"lists.
And the comminity blocklists are updated when more than a couple ( I think the number is something like 10-50 ) instances of crowdsec block an ip in some fast timeframe.
The ai blocklist just adds IP when even one instance finds an AI trying to scrape right from the useragent.
So even if the community blocklist has fewer ai ip’s, it does eventually include them.
Which Crowd-Sec blocklists are you using?
I’m using the default list alongside Firehol BotScout list and Firehol cybercrime tracker list set to ban.
Also using the Firehol cruzit.com list set to do captcha, just in case it’s not actually a bot.
I’m also using the cs-firewall-bouncer and a custom bouncer that’s shown on crowdsecs tutorials to detect privilege escalation for if anybody actually manages to get inside.
Alongside that I’m using a lot of scenario collection’s for specific software I’m using like nextcloud, grafana, ssh, … which helps a lot with attacks directly done on a service and not just general scraping or both path traversing.
All free and have been using it for a year, only complaint I have is that I had to make a cronjob to restart the crowdsec service every day because it would stop working after a couple days because of the amount of requests it has to process.
Crowdsec has default scenarios and lists that might block a lot of it, and you can pretty easily make a custom scenario to block IPs that cause large spikes of traffic to your applications if needed.
deleted by creator
Im struggling to find it, but theres like an “AI tarpit” that causes scrapers to get stuck. something like that? Im sure I saw it posted on lemmy recently. hopefully someone can link it
I did find this github link as the first search result, looks interesting, thanks for letting me know the term “tar pit”.
there is also https://forge.hackers.town/hackers.town/nepenthes
Now I just want to host a web page and expose it with nepenthes…
First, because I’m a big fan of carnivorous plants.
Second, because it let’s you poison LLMs, AI and fuck with their data.
Lastly, because I can do my part and say F#CK Y0U to those privacy data hungry a$$holes !
I don’t even expose anything directly to the web (always accessible through a tunnel like wireguard) or have any important data to protect from AI or LLMs. But just giving the opportunity to fuck with them while they continuously harvest data from everyone is something I was already thinking off but didn’t knew how.
Thanks for the link !
If you’re looking to stop them from wasting your traffic, do not use a tarpit. The whole point of it is that it makes the scraper get stuck on your server forever. That means you pay for the traffic the scraper uses, and it will continually rack up those charges until the people running it wise up and ban your server. The question you gotta ask yourself is, who has more money, you or the massive AI corp?
Tarpits are the dumbest bit of anti-AI tech to come out yet.
There’s more than one style of tar pit. In this case you obviously wouldn’t want to use an endless maze style.
What you want to do in this case is send them through an HA proxy that would redirect them on user agent, whenever they come in as Claude you send them over to a box running on a Wanem process at modem speeds.
They’ll immediately realize they’ve got a hug of death going on and give up.
If you had read the OP, they don’t want the scrapers using up all their traffic.
yes i did read OP.
ed. i see this was downvoted without a response. But il put this out there anyway.
If you host a public site, which you expect anyone can access, there is very little you can do to exclude an AI scraper specifically.
Hosting your own site for personal use? IP blocks etc will prevent scraping.
But how do you identify legitimate users from scrapers? Its very difficult.
They will use your traffic up either way. Dont want that? You could waste their time (tarpit), or take your hosting away from public access.
Downvoter. Whats your alternative?
I guess sending tar bombs can be fun
Go on.
You first pick them up.
Then you throw them.
Classic!
instructions unclear, tsar bomba away
Honestly we need some sort of proof of work (PoW)
This is the most realistic solution. Adding a 0.5/1s PoW to hosted services isn’t gonna be a big deal for the end user, but offers a tiny bit of protection against bots, especially if the work factor is variable and escalates.
It also is practical for bots. It forces people to not abuse resources.
There are a lot of crypto which increase workfactor PoW to combat spam. Nano is one of them, so it’s a pretty proven technology, too.
You said
I’m only really running a caddy reverse proxy on the VPS which forwards my home server’s services through Tailscale. "
It seems then that you are using a Tailscale Funnel to expose your services to the public web. Is this the case? I ask because the basic premise of Tailscale is that you have to be logged into your Tailscale network to access the services and if you are not logged in, then the site you try to access won’t even appear to exist. Unless it’s setup via the Funnel.
Assuming then that you setup a funnel, then you are now 100% exposed to the WWW. AI Bots and bots in general crawl the WWW daily and eventually your site will be found. You have a few choices here, rely on a Web Application Firewall (WAF) such as Bunkerweb which would replace Caddy, but would provide a decent firewall of sorts. Or…you can use something like Config Server Firewall but I’m not sure if they have AI Bot protection. The last I used them was before AI was a thing.
Build tar pits.
They want to reduce the bandwidth usage. Not increase it!
A good tar pit will reduce your bandwidth. Tarpits aren’t about shoving useless data at bots; they’re about responding as slow as possible to keep the bot connected for as long as possible while giving it nothing.
Endlessh accepts the connection and then… does nothing. It doesn’t even actually perform the SSL negotiation. It just very… slowly… sends… an endless preamble, until the bot gives up.
As I write, my Internet-facing SSH tarpit currently has 27 clients trapped in it. A few of these have been connected for weeks. In one particular spike it had 1,378 clients trapped at once, lasting about 20 hours.
this might not be what you meant, but the word “tar” made me think of tar.gz. Don’t most sites compress the HTTP response body with gzip? What’s to stop you from sending a zip bomb over the network?
Even if that was possible, I don’t want to crash innocents peoples browsers. My tar pits are deployed on live environments that normal users could find themselves navigating to and it’s overkill when if you simply respond to 404 Not Found with 200 OK and serve 15MB on the “error” page then bots will stop going to your site because you’re not important enough to deal with. It’s a low bar, but your data isn’t worth someone looking at your tactics and even thinking about circumventing it. They just stop attacking you.