As much as the title sounds like something out of a terminator movie, spidering robots that won’t obey the settings in your robot.txt file can be a disaster.
Spidering robots are used by all searches engines to read the contents of web pages, and return search results when someone enters a search term. Most spidering robots are friendly, and in most cases not just harmless but beneficial — happy spider robots means high rankings on Google.
But there are also less legitimate spidering robots out there that can be a source of lag, down time, increased hosting costs and even commit theft and plagiarism.
To catch these nasty robots, have your web dev set up a directory on your server with a single file inside. In our example, we’ll call the directory honeypot and the file bait.txt. They should then declare that directory offlimits in your robots.txt settings by entering the following lines:
This snippet of code tells all robots, they are not welcome in the honeypot.
When your web dev checks your server access logs, they’ll be able to see immediately which robots let themselves in anyway. You can ban the IP of these robots to prevent them from harassing you any further.
Remember to check your access logs at frequent intervals. There are new malicious spider bots built every day, and it’s good practice to stay on top of them.