I occasionally have robots that cause problems for my site. I have now put in checks to catch robots that ignore meta robots tags. While this particular fault isn't a problem for me, there is a correlation between robots that ignore these tags and ones used to extract email addresses for spammers and ones that grab pages as fast as they can. Eventually I might put in checks to block robots that grab pages too fast even if they do pay attention to meta robots tags.
What I do is sprinkle hidden links to a page in most of my web pages (as in the signature area at the bottom of this page) that has meta tags instructing robots to neither index nor follow links off that page. That page has a warning for humans to not follow any links. The links on that page point to a cgi-bin application which will add the remote IP address to a block list.
I have also added some mailto links pointing to spamtrap addresses which add the sending host to my blacklist. Mail sent to these addresses is run through spamtrap.pl. I also have bound a key in my MUA so I can easily block an address used to send me spam that got through my filters.
I also use the data to block email, by sharing the database with rbldns.
The next step is to also use outside blocklists (in particular Osirusoft looks like a good source) to ad to the list. I have some zone information from Osirusoft, but it is pretty large. I want to switch it to cidr format (by forming groups, not just using /32s and /24s) and read that data whenever my block list gets updated.
I prefer not to use Osirusoft's DNS servers for privacy reasons. The compressed file is big enough that I will probably only get it once a month for as long as they let me. Eventually I might trade mirroring them for better access, but since they don't use rbldns friendly data and I can't afford a lot of traffic right now, it isn't going to happen anytime soon.
In your apache configuration file you can include something similar to the following:
RewriteEngine On Rewritemap blocked prg:/var/www/html/robot/check.pl
In your top .htaccess file put the following:
RewriteEngine On RewriteCond ${blocked:%{REMOTE_ADDR}} BLOCKED [NC] RewriteRule ^.*$ - [F]
In my case I provide an alternate error page in the rewrite that explains why the access is forbidden.
Below is a link to all of the perl source code to do this. You also need CDB_File.
There is a bug in apache 2.0.39 and earlier (and I think 1.3.26 and earlier) with synchronizing an external map program between the server processes that breaks this. It should be fixed in the next release of apache. I am currently using a fix from the apache group that hasn't been committed to CVS as of yet.
If you do use this code, please consider changing some of the link names around to increase the customization required by people modifying abusive robots to get around this block without really following the meta robot tag instructions.
A secure version of this page is located at: https://wolff.to/robot/