If you are using Hugo use this robots.txt template that automatically updates every build:
{{- $url := "https://raw.githubusercontent.com/ai-robots-txt/ai.robots.txt/refs/heads/main/robots.txt" -}} {{- $resource := resources.GetRemote $url -}} {{- with try $resource -}} {{ with .Err }} {{ errorf "%s" . }} {{ else with .Value }} {{- .Content -}} {{ else }} {{ errorf "Unable to get remote resource %q" $url }} {{ end }} {{ end -}} Sitemap: {{ "sitemap.xml" | absURL }}
Optionally if lead rouge bots to poisoned pages:
{{- $url := "https://raw.githubusercontent.com/ai-robots-txt/ai.robots.txt/refs/heads/main/robots.txt" -}} {{- $resource := resources.GetRemote $url -}} {{- with try $resource -}} {{ with .Err }} {{ errorf "%s" . }} {{ else with .Value }} {{- printf "%s\n%s\n\n" "User-Agent: *" "Disallow: /train-me" }} {{- .Content -}} {{ else }} {{ errorf "Unable to get remote resource %q" $url }} {{ end }} {{ end -}} Sitemap: {{ "sitemap.xml" | absURL }}
Check out how to poison your pages for rouge bots in this articleRepo was deleted and the internet archive was excluded.
I use Quixotic and a Python script to poison the pages and I included those in my site update script.
Its all cobbled together in amateur fashion from the deleted article but its honest work.
Most AI crawlers don’t respect robots.txt files, but this info might be useful for other forms of blocking.
The repo, despite its name, doesn’t only contain a robots.txt. It also has files for popular reverse proxies to block crawlers outright.
That was kind of the point of my comment since the name didn’t indicate that. Also many tools that companies would use won’t/can’t use these files, but could still make use of the info. As I am specifically in that case, I wanted people to know that it could still be worth their time taking a look.
robots.txt doesn’t do any sort of blocking. It’s nothing more than a request. This is active blocking.
Although I’m not sure how successful it will be, given the determination of these bots.
A few of them are quite good at randomizing their user-agent and using a large number of IP blocks. I’ve not had a fun time trying to limit them.
Yeah dude, they’re extremely malicious and not even trying to hide it anymore. They don’t give a fuck that they’re DDOSing the entire internet.
If only they could read
That’s pretty sweet but just be aware a lot of bots are bad actors and don’t advertise a proper user agent, so you have to also block by ips. Blocking all alibaba server ips is a good start.
This is an nginx reverse proxy configuration. It’s not passive, like robots.txt, but they probably named it like thatin solidarity with the intent of robots.txt. You’re on-point about Alibaba though, which I’m sure could be somewhat easily added to this nginx blocking strategy. Anubis is still probably a better solution, since it doesn’t have that limitation of having LLM bots pass a user-agent.