We recently got beeped that a customer’s small Cloud VM running a CMS was unresponsive. Upon checking into it, the load was skyrocketing and it was in swap. There were lots of port 80 connections, so we shutdown apache until the load settled down.
A quick look at the logs showed the site was being hit by multiple dispersed ips all crawling the site. All of them had a browser string that said:
(compatible; 008/0.83; http://www.80legs.com/webcrawler.html) Gecko/2008032620
So we turned back on apache and by tailing the logs, we were able to firewall out the crawling ips as they appeared and after about 20 of those, the site was stable, even though a few 80legs.com stragglers were coming through.
So we googled around and 80legs.com appears to be a legitimate outfit.
They run a crawling service where people pay them to crawl your site. They use a ‘distributed’ system, which seems to mean that they have distributed their code into browser tool bars and they have random people collecting the data on their home/business system (i.e. parasitic computing) and they coordinate everything with a control process at their facilities. Kind of an opt-in botnet.
In their FAQ, they claim to only allow 1 request a second. And in fact, that is true, since it was more like a request every 2-3 seconds per IP. The problem is they had multiple IPs all doing that and it overwhelmed the little site. The installed caching module didn’t help because they were constantly asking for new pages, which meant a trip to the DB, create page, etc.
So according to the 80legs FAQ, they follow robots.txt conventions and our guy didn’t have that, so we added a robots.txt file that slowed down search engines to a single request every 15 seconds. So even with 15-20 ips all crawling it should keep them down to a dull roar and an aggregate of 1-2 per second.
User-agent: * Crawl-Delay: 15
which is a ‘good thing’ anyway.
Later on, when the customer was informed of the problem, he had us add a complete block for that search engine. His reasoning was that if someone wanted to see his data enough to pay to get it, they should come to the site themselves and get the whole effect rather than some report.
User-agent: 008 Disallow: /
We haven’t seen any 80legs.com activity in those logs since then.
So 80legs.com is a little aggressive but seems to follow the rules and perhaps could tune up their control bot a bit, but I imagine its hard to coordinate all their minions to the second.
The lesson is that all sites should have a robots.txt file. Its useful for any search engine.