Playing with the AOL Search Data

As I continue to play with the AOL search data one conclusion has become obvious: The tools for wandering through large datasets are dreadful. There are more than 20-million searches in the data, which means, for example, that Excel (with its 65,536 row limit) can only hold about 3% of the data. I have also been using SPSS, which is better. It can hold something like 2-billion cases, which is nice, but even with a few hundred thousand cases loaded it gets very, very slow.

Comments

  1. The data would be extremely useful if properly loaded into MySQL or other relational database, and then worked with from there.
    Might be a good, extreme, case for really truly testing google spreadsheets (I cant say that I know what, if any, the limitations of that product are) 😀

  2. scratch that, from the FAQ:
    “You may create up to 100 spreadsheets, each of which may contain up to 20 tabs, 50,000 cells, 256 columns or 10,000 rows – whichever comes first (meaning, any one of these limits may prevent you from continuing to add data to a spreadsheet). We allow you to import .xls and .csv files which are approximately 400k in size originally.”

  3. yeah, I think i’ll just pour it all into mysql and be done with it. the lazy part of me liked doing ad hoc exploration in a more interactive tool, but that’s apparently not in the cards.

  4. just wait for SAP’s plans to introduce in-memory databases. Only be prepared to also pay for an army of Acceture consultants…
    what I am more interested in is after all your mining, did you find a jewel or two?

  5. I’ve had good luck with SQLite for things like this. It’s pretty much a full database, but has almost zero setup, no separate engine to run, etc:
    http://www.parand.com/say/index.php/2006/05/30/simple-data-analysis-with-sqlite/
    http://sqlite.org/

  6. I’ve created an Overture-like tool that shows top 1000 results for any give keyword/phrase. The tool then allows you to view the websites clicked and their average page placement. We DON’T show the user data cause even we are a little miffed about the privacy concerns..
    This should help with peoples research hopefully.
    http://dontdelete.com (the domain is still replicating)
    OR the IP
    63.212.167.185
    Let me know what you think..

  7. Nifty tool man…is that ALL the results?

  8. yes, this is all of the results…some of the AOL search sites do not have all of the results but this one does

  9. That’s why they invented grep.

  10. A Wiki has been set up for collaborative analysis and sharing of results from analyzing these data – please contribute.