LazyWeb Request: Automated Locale for Arbitrary URLs

Here is what I need, so if someone has a solution I’d happily hear it: Take as input an arbitrary list of company URLs, and then deliver as output a list of locations for all the companies. I don’t want to know where they’re hosted, or other domain registration information, but headquarters office data for all companies.

Anyone had to do this before?

I run into this problem all the time and have yet to come up with a good solution. The obvious one, of course, would be to do some semi-intelligent scraping, but the problem is just infrequent enough, and the requisite code just complex enough (locale is never in the same place, nor stored in the same way) that I haven’t rolled a Perl script.



  1. Heh… this is what the “Semantic Web” people have been on about all these years. They say “If we could get people to post unambiguous clearly marked up data in a machine-readable form, you wouldn’t need to screen-scrape any more and much magic would ensue.” Maybe it would if they could, but so far nobody is willing to go to the work of posting clean data today in exchange for nebulous benefits tomorrow.

  2. Kinda. At one point we used a shell script and PERL function to rummage around in Yahoo’s financial website and return data on the company – stock price, market and so on. The location of the headquarters was part of the data returned, but I discarded it.
    It all went into a SQL db. I’ve been toying with reviving that DB and pushing it to the web. All I need is motivation.

  3. This is done in local search (location matters)–some tech doesn’t seek to verify the validity of info (not such an issue with corporate HQ), rendering some data very poor, but the smart ones verify across multiple sources.
    Give me the URLs and I’ll get you address info, hours of operation, long/lat, etc., all formatted to your heart’s content. šŸ˜‰

  4. You may want to try our Google Finance scraper at šŸ˜‰

  5. Sounds like a job for Amazon’s Mechanical Turk….

  6. Re: our Google Finance scraper handles searching for private companies relatively well. From the FAQ, it seems they are using data from (although it is still possible to find data from for a company that does not offer)
    Case in hand: some corporate info and headquarters are available for YouTube and MySpace.

  7. David Beroff says:

    Dang! Ryan beat me to the punch, but I’ll add another vote for! :-)

  8. There are a number of large directory companies that hire hords of people to establish and maintain this type of data. Automating this process is possible, but one has to set expectations for a certain amount of error. I worked for a company (WhizBang!Labs) which developed a fully automated solution – we had a number of early adopters, but ultimately fell foul of the dot com collapse.

  9. If the “adr” [1] and “geo” [2] microformats could get some bottom-up traction like RSS got a while ago, the hard part of the problem would be solved…