Once in a while someone attempts to download every page in this site. Of course, that's not a problem, it is a problem if someone tries that at too high rate. So I've added some scripts to detect those downloaders and to try to stop them. If someone is downloading the pages too fast, my little home-server is having problems keeping up and other viewers may have troubles seeing pages.
Today, just two days after I've installed the scripts, I have caught the first offender: 862 pages in 16 minutes, so that's almost a page per second. No human can read the pages that fast, even if you skip the boring stuff.
The scripts record in a database who reads which page at what moment. (stolen from ip-tracking). Whenever a page is requested, the scripts checks how many pages you've read in the last period, and if that's too much, they redirect you to a simple page telling you to slow down and wait for a while before resuming.
Things I would like to change:
- create the pages with less database-work, so I would optimize some scripts or change the database.
- Make the 'lock' last longer, a day or even a week
- Detect the fast downloaders earlier, perhaps by checking how many times they follow all the links in the menu or perhaps implementing some cron-job or maybe even both
For now it's just testing and checking the logs.
update: The system works ok, I've caught 58 offenders in 2 months, downloading over 60.000 pages. I'm still trying to generate less database work. Funny thing is, that even Google, Alexa and Live Search have been banned for a short while, despite my attempts to slow down the bots in robots.txt, obviously this is sometimes ignored. The most used software is HTTrack, they have a funny FAQ-page about abuse. It has some nice solutions too. I'm not trying to exclude some people, I'm just trying to keep everything working.
Traffic is still going up, I now have more than 5 times as many hits as last year.
If this keeps up, I might have to take drastic measures.
update2: Just got hit by someone (184.108.40.206) using Snoopy v1.2.3 who downloaded 8000 pages and tried posting something 1480 times in less than 2 hours. I've added the check to even more pages. The hostname is easily found on internet and has posted all kinds of spam-links.