The Rise (and Fall?) of the Scraper-conomy

Greg Yardley makes a great point that I’ve been thinking about as well: The broad realization that there is value in aggregating content, and that much of that content lies exposed on websites everywhere, has given rise to a whole new generation of companies built around web-scraping technology. While scraping has been around forever, and every crawler from Google on down scrapes sites, the difference now is the number and aggressiveness of such scrapers, as well as how they’re making money from your content:

… there’s going to be more aggressive spidering and site-scraping, to the point where it puts a serious load on any sites dependent on database queries. … These additional spiders will be a natural consequence of the entrepreneurial boomlet and the growing realization that aggregated content has real value; people looking for easy business models combined with the spread of easy-to-scrape formats across the Web are going to  drive an explosion of vertical search engines and niche directories. The availability of better-structured data is only going to drive this – your site might not publish a feed, use microformats, or make use of structured blogging, but the mere knowledge that such content is out there somewhere is going to drive scrapers to your site to look for it.

…by the end of the year, I predict the rise of easy-to-use services that block all but the most cautious ‘bots automatically, giving webmasters the option of approving crawlers after they’re detected. These services will put an end to the scraping, but  they’ll also make it very difficult for new aggregation-based businesses to get  off the ground.

Spot-on. And I’m expecting a new cadre of startups aimed at detecting stealthy spiders, the earliest couple of which startups will be likely be snapped up by security vendors by the end of the year.


  1. the place to do this is at the hardware router or load-balancing level, although i suspect initial efforts will be software based running on the webserver itself (which is the wrong place, by time you accept a request and figure out it is evil, it would have been cheaper to just blindly serve them the content).
    this is just another traffic shaping problem, like denial of service defense, blacklists, etc. hance it should reside where other traffic shaping is done.

  2. I expect that when these guys start to catalog the “stealthy spiders” it will do two things:
    1.) Decimate the ability for new search applications to rise up and force developers to utilize existing search APIs from the big guys (not good).
    2.) Start tagging normal web browsers and blacklisting them within their service reach becuase of the spider assumption. Which can’t be good for Avg Joe user who is not going to have any clue what they are doing wrong.
    I hope the innovation is good enough to handle these problems. Would most certainly be an interesting business if it did.