ai.robots.txt: A list of AI agents and robots to block. (and the configuration files to block them)

Vittelius@feddit.org · 12 days ago

ai.robots.txt: A list of AI agents and robots to block. (and the configuration files to block them)

alecsargent@lemmy.zip · 11 days ago

If you are using Hugo use this robots.txt template that automatically updates every build:

{{- $url := "https://raw.githubusercontent.com/ai-robots-txt/ai.robots.txt/refs/heads/main/robots.txt" -}}
{{- $resource := resources.GetRemote $url -}}
{{- with try $resource -}}
  {{ with .Err }}
    {{ errorf "%s" . }}
  {{ else with .Value }}
	{{- .Content -}}
  {{ else }}
    {{ errorf "Unable to get remote resource %q" $url }}
  {{ end }}
{{ end -}}

Sitemap: {{ "sitemap.xml" | absURL }}

Optionally if lead rouge bots to poisoned pages:

{{- $url := "https://raw.githubusercontent.com/ai-robots-txt/ai.robots.txt/refs/heads/main/robots.txt" -}}
{{- $resource := resources.GetRemote $url -}}
{{- with try $resource -}}
  {{ with .Err }}
    {{ errorf "%s" . }}
  {{ else with .Value }}
    {{- printf "%s\n%s\n\n" "User-Agent: *" "Disallow: /train-me" }}
    {{- .Content -}}
  {{ else }}
    {{ errorf "Unable to get remote resource %q" $url }}
  {{ end }}
{{ end -}}

Sitemap: {{ "sitemap.xml" | absURL }}

~~Check out how to poison your pages for rouge bots in this article~~

Repo was deleted and the internet archive was excluded.

I use Quixotic and a Python script to poison the pages and I included those in my site update script.

Its all cobbled together in amateur fashion from the deleted article but its honest work.

zod000@lemmy.dbzer0.com · 12 days ago

Most AI crawlers don’t respect robots.txt files, but this info might be useful for other forms of blocking.

Vittelius@feddit.org · 12 days ago

The repo, despite its name, doesn’t only contain a robots.txt. It also has files for popular reverse proxies to block crawlers outright.

zod000@lemmy.dbzer0.com · 12 days ago

That was kind of the point of my comment since the name didn’t indicate that. Also many tools that companies would use won’t/can’t use these files, but could still make use of the info. As I am specifically in that case, I wanted people to know that it could still be worth their time taking a look.

Ulrich@feddit.org · 12 days ago

robots.txt doesn’t do any sort of blocking. It’s nothing more than a request. This is active blocking.

Although I’m not sure how successful it will be, given the determination of these bots.

zod000@lemmy.dbzer0.com · 11 days ago

A few of them are quite good at randomizing their user-agent and using a large number of IP blocks. I’ve not had a fun time trying to limit them.

Ulrich@feddit.org · 11 days ago

Yeah dude, they’re extremely malicious and not even trying to hide it anymore. They don’t give a fuck that they’re DDOSing the entire internet.

oplkill@lemmy.world · 12 days ago

If only they could read

db0@lemmy.dbzer0.com · 12 days ago

That’s pretty sweet but just be aware a lot of bots are bad actors and don’t advertise a proper user agent, so you have to also block by ips. Blocking all alibaba server ips is a good start.

plz1@lemmy.world · 12 days ago

This is an nginx reverse proxy configuration. It’s not passive, like robots.txt, but they probably named it like thatin solidarity with the intent of robots.txt. You’re on-point about Alibaba though, which I’m sure could be somewhat easily added to this nginx blocking strategy. Anubis is still probably a better solution, since it doesn’t have that limitation of having LLM bots pass a user-agent.