LazyWeb Request: Automated Locale for Arbitrary URLs

Here is what I need, so if someone has a solution I’d happily hear it: Take as input an arbitrary list of company URLs, and then deliver as output a list of locations for all the companies. I don’t want to know where they’re hosted, or other domain registration information, but headquarters office data for all companies.

Anyone had to do this before?

I run into this problem all the time and have yet to come up with a good solution. The obvious one, of course, would be to do some semi-intelligent scraping, but the problem is just infrequent enough, and the requisite code just complex enough (locale is never in the same place, nor stored in the same way) that I haven’t rolled a Perl script.

Thoughts?

Related posts:

  1. Amazon Renovated URLs?
  2. New Comment-Spam URLs Mutating
  3. Canonical URLS, Network Effects, and the Digital Me
  4. Big Companies, Small Companies, and Media Markets
  5. Neighborhood Breaking News

Comments

  1. Tim Bray says:

    Heh… this is what the “Semantic Web” people have been on about all these years. They say “If we could get people to post unambiguous clearly marked up data in a machine-readable form, you wouldn’t need to screen-scrape any more and much magic would ensue.” Maybe it would if they could, but so far nobody is willing to go to the work of posting clean data today in exchange for nebulous benefits tomorrow.

  2. Brian says:

    Kinda. At one point we used a shell script and PERL function to rummage around in Yahoo’s financial website and return data on the company – stock price, market and so on. The location of the headquarters was part of the data returned, but I discarded it.
    It all went into a SQL db. I’ve been toying with reviving that DB and pushing it to the web. All I need is motivation.

  3. ian says:

    This is done in local search (location matters)–some tech doesn’t seek to verify the validity of info (not such an issue with corporate HQ), rendering some data very poor, but the smart ones verify across multiple sources.
    Give me the URLs and I’ll get you address info, hours of operation, long/lat, etc., all formatted to your heart’s content. ;)

  4. You may want to try our Google Finance scraper at https://code.poly9.com/trac/wiki/HQQuery ;)

  5. Ryan Coleman says:

    Sounds like a job for Amazon’s Mechanical Turk….

  6. Re: our Google Finance scraper
    finance.google.com handles searching for private companies relatively well. From the FAQ, it seems they are using data from hoovers.com (although it is still possible to find data from hoovers.com for a company that finance.google.com does not offer)
    Case in hand: some corporate info and headquarters are available for YouTube and MySpace.

  7. David Beroff says:

    Dang! Ryan beat me to the punch, but I’ll add another vote for MTurk.com! :-)

  8. There are a number of large directory companies that hire hords of people to establish and maintain this type of data. Automating this process is possible, but one has to set expectations for a certain amount of error. I worked for a company (WhizBang!Labs) which developed a fully automated solution – we had a number of early adopters, but ultimately fell foul of the dot com collapse.

  9. Jim says:
  10. If the “adr” [1] and “geo” [2] microformats could get some bottom-up traction like RSS got a while ago, the hard part of the problem would be solved…
    1. http://microformats.org/wiki/adr
    2. http://microformats.org/wiki/geo