• alecsargent@lemmy.zip
    link
    fedilink
    arrow-up
    0
    ·
    11 days ago

    If you are using Hugo use this robots.txt template that automatically updates every build:

    {{- $url := "https://raw.githubusercontent.com/ai-robots-txt/ai.robots.txt/refs/heads/main/robots.txt" -}}
    {{- $resource := resources.GetRemote $url -}}
    {{- with try $resource -}}
      {{ with .Err }}
        {{ errorf "%s" . }}
      {{ else with .Value }}
    	{{- .Content -}}
      {{ else }}
        {{ errorf "Unable to get remote resource %q" $url }}
      {{ end }}
    {{ end -}}
    
    Sitemap: {{ "sitemap.xml" | absURL }}
    

    Optionally if lead rouge bots to poisoned pages:

    {{- $url := "https://raw.githubusercontent.com/ai-robots-txt/ai.robots.txt/refs/heads/main/robots.txt" -}}
    {{- $resource := resources.GetRemote $url -}}
    {{- with try $resource -}}
      {{ with .Err }}
        {{ errorf "%s" . }}
      {{ else with .Value }}
        {{- printf "%s\n%s\n\n" "User-Agent: *" "Disallow: /train-me" }}
        {{- .Content -}}
      {{ else }}
        {{ errorf "Unable to get remote resource %q" $url }}
      {{ end }}
    {{ end -}}
    
    Sitemap: {{ "sitemap.xml" | absURL }}
    

    Check out how to poison your pages for rouge bots in this article

    Repo was deleted and the internet archive was excluded.

    I use Quixotic and a Python script to poison the pages and I included those in my site update script.

    Its all cobbled together in amateur fashion from the deleted article but its honest work.

  • zod000@lemmy.dbzer0.com
    link
    fedilink
    arrow-up
    0
    ·
    12 days ago

    Most AI crawlers don’t respect robots.txt files, but this info might be useful for other forms of blocking.

    • Vittelius@feddit.orgOP
      link
      fedilink
      arrow-up
      0
      ·
      12 days ago

      The repo, despite its name, doesn’t only contain a robots.txt. It also has files for popular reverse proxies to block crawlers outright.

      • zod000@lemmy.dbzer0.com
        link
        fedilink
        arrow-up
        0
        ·
        12 days ago

        That was kind of the point of my comment since the name didn’t indicate that. Also many tools that companies would use won’t/can’t use these files, but could still make use of the info. As I am specifically in that case, I wanted people to know that it could still be worth their time taking a look.

    • Ulrich@feddit.org
      link
      fedilink
      English
      arrow-up
      0
      ·
      12 days ago

      robots.txt doesn’t do any sort of blocking. It’s nothing more than a request. This is active blocking.

      Although I’m not sure how successful it will be, given the determination of these bots.

      • zod000@lemmy.dbzer0.com
        link
        fedilink
        arrow-up
        0
        ·
        11 days ago

        A few of them are quite good at randomizing their user-agent and using a large number of IP blocks. I’ve not had a fun time trying to limit them.

        • Ulrich@feddit.org
          link
          fedilink
          English
          arrow-up
          0
          ·
          11 days ago

          Yeah dude, they’re extremely malicious and not even trying to hide it anymore. They don’t give a fuck that they’re DDOSing the entire internet.

  • db0@lemmy.dbzer0.com
    link
    fedilink
    arrow-up
    0
    ·
    12 days ago

    That’s pretty sweet but just be aware a lot of bots are bad actors and don’t advertise a proper user agent, so you have to also block by ips. Blocking all alibaba server ips is a good start.

    • plz1@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      12 days ago

      This is an nginx reverse proxy configuration. It’s not passive, like robots.txt, but they probably named it like thatin solidarity with the intent of robots.txt. You’re on-point about Alibaba though, which I’m sure could be somewhat easily added to this nginx blocking strategy. Anubis is still probably a better solution, since it doesn’t have that limitation of having LLM bots pass a user-agent.