Greg Yardley makes a great point that I’ve been thinking about as well: The broad realization that there is value in aggregating content, and that much of that content lies exposed on websites everywhere, has given rise to a whole new generation of companies built around web-scraping technology. While scraping has been around forever, and every crawler from Google on down scrapes sites, the difference now is the number and aggressiveness of such scrapers, as well as how they’re making money from your content:
… there’s going to be more aggressive spidering and site-scraping, to the point where it puts a serious load on any sites dependent on database queries. … These additional spiders will be a natural consequence of the entrepreneurial boomlet and the growing realization that aggregated content has real value; people looking for easy business models combined with the spread of easy-to-scrape formats across the Web are going to drive an explosion of vertical search engines and niche directories. The availability of better-structured data is only going to drive this – your site might not publish a feed, use microformats, or make use of structured blogging, but the mere knowledge that such content is out there somewhere is going to drive scrapers to your site to look for it.
…by the end of the year, I predict the rise of easy-to-use services that block all but the most cautious ‘bots automatically, giving webmasters the option of approving crawlers after they’re detected. These services will put an end to the scraping, but they’ll also make it very difficult for new aggregation-based businesses to get off the ground.
Spot-on. And I’m expecting a new cadre of startups aimed at detecting stealthy spiders, the earliest couple of which startups will be likely be snapped up by security vendors by the end of the year.