So it's been about ten years since I last posted on this thing, guess I should give it another shot 😛

My #web server hosts a bunch of websites, and I've been watching crawlers scan it for a while. Some of them are pretty mysterious (“Thinkbot” indeed? And it suggests you block its IP while using a decently large IP range), others are obvious (ChatGPT, Google's non-search user agents).

I certainly don't want to encourage LLMs and “AI” (and I hate calling it that, it's nothing of the sort) so I took a very basic attempt at blocking it, by finding a blog post of someone else doing the exact same thing as a starting point.

At this point I'll give a small warning – the list I started with included common scripting language user agents like the ones used by Python and Go. At first glance this makes sense, since some bots will indeed be written with those and not bother setting a custom one, but on the other hand there's plenty of legitimate apps that do the same. The one it took me three months to notice was BlueSky, which uses go and yet my PDS-based account still “worked”, it's just that all the posts I made during that time disappeared into the ether (they're still in the PDS's database but you'll never see them visiting bsky.app). So don't do that!

So I had a bunch of useragents blocked, and nginx will respond 418 I'm a teapot. I add more user agents as I see them appear (e.g. the aforementioned Thinkbot) but I notice that the crawlers are hugely preferring an issue tracker I have installed on one site. My assumption is that they've been programmed to prefer forum-like web apps in the hope that they'll have more “human” content with less LLM pollution.

To support this theory I've noticed a number of smaller but active forums, still using more oldschool (better, as they're more reserved with things like DHTML infinite scrolling) forum software, all started using Cloudflare around the same time, regularly challenging users to the point that posts were often getting lost as Cloudflare apparently has no provisions to save POSTs they decide to interrupt. It's interesting since some of these forums are very anti-big-#internet and looking for a web 1.0 experience, and so are pretty anti-Cloudflare other than the need for anti-bot protection.

Sadly, I'm not really sure what the solution is. Since they're targetting such specific subsets it does feel like there might be an opportunity to poison the well, but it'd need to be in such a way that's transparent to the real users.