A few people have pointed out to me today that Google long ago planned to do at least part of what Amazon has announced with Alexa. Here’s the relevant para from the “backrub” paper at Stanford:
Our final design goal was to build an architecture that can support novel research activities on large-scale web data. To support novel research uses, Google stores all of the actual documents it crawls in compressed form. One of our main goals in designing Google was to set up an environment where other researchers can come in quickly, process large chunks of the web, and produce interesting results that would have been very difficult to produce otherwise. In the short time the system has been up, there have already been several papers using databases generated by Google, and many others are underway. Another goal we have is to set up a Spacelab-like environment where researchers or even students can propose and do interesting experiments on our large-scale web data.
Does anyone know if Google is supporting this kind of work? I’m not aware of any non-Google researchers playing with Google’s data directly.
Related posts:
I don’t claim to be a search expert by any means, but this doesn’t seem like a big deal to me.
Technically, it’s no big deal because having the index (and programmatic access to it) is nice, but that’s not the difficult problem for search. Expensive, perhaps. But difficult? No.
The difficult problem is relevancy. That’s why Google killed Inktomi and all the rest. Their results were just better. No one goes to the inferior solution, even if it’s 95% as good. Everyone wants to know what that extra 5% of search relevancy might offer them.
Not only is Google’s index is great, it’s PageRank and relevancy algorithms are even better. So what Alexa is offering you is an inferior index and leaving it up to you to create the relevancy application.
I’m sure this will be used in creative and imaginative ways, which is great, I just don’t see what all the fuss is about.
If I’m missing something, please fill me in….
To have programmatic access to a industrial strength search system is the start of commoditizing the search business. i can see it difficult for people to just wip up their own indexes but the research community will pick up on this immediately and it will be interesting to watch.
Google does make it very easy to do this kind of analyses internally with their powerful parallel processing tools like Sawzall and MapReduce.
http://glinden.blogspot.com/2005/07/google-sawzall.html
But, I’m not aware of any example of where they have let people outside Google use these tools. I know I sure would love to get my hands on them.