Playing with the AOL Search Data

As I continue to play with the AOL search data one conclusion has become obvious: The tools for wandering through large datasets are dreadful. There are more than 20-million searches in the data, which means, for example, that Excel (with its 65,536 row limit) can only hold about 3% of the data. I have also been using SPSS, which is better. It can hold something like 2-billion cases, which is nice, but even with a few hundred thousand cases loaded it gets very, very slow.

Related posts:

  1. Persistent Search, Unstructured Data, and the Future of Trading
  2. Gridstone Research — Organizing Unstructured Financial Data
  3. Why No Live Amazon Associates Data?
  4. Open-Access Research is Leading to a Data Deluge
  5. Pecked to Death By Ducks: The Location-based Data Explosion


  1. apokalyptik says:

    The data would be extremely useful if properly loaded into MySQL or other relational database, and then worked with from there.
    Might be a good, extreme, case for really truly testing google spreadsheets (I cant say that I know what, if any, the limitations of that product are) :D

  2. apokalyptik says:

    scratch that, from the FAQ:
    “You may create up to 100 spreadsheets, each of which may contain up to 20 tabs, 50,000 cells, 256 columns or 10,000 rows – whichever comes first (meaning, any one of these limits may prevent you from continuing to add data to a spreadsheet). We allow you to import .xls and .csv files which are approximately 400k in size originally.”

  3. yeah, I think i’ll just pour it all into mysql and be done with it. the lazy part of me liked doing ad hoc exploration in a more interactive tool, but that’s apparently not in the cards.

  4. just wait for SAP’s plans to introduce in-memory databases. Only be prepared to also pay for an army of Acceture consultants…
    what I am more interested in is after all your mining, did you find a jewel or two?

  5. Parand says:

    I’ve had good luck with SQLite for things like this. It’s pretty much a full database, but has almost zero setup, no separate engine to run, etc:

  6. Rob says:

    I’ve created an Overture-like tool that shows top 1000 results for any give keyword/phrase. The tool then allows you to view the websites clicked and their average page placement. We DON’T show the user data cause even we are a little miffed about the privacy concerns..
    This should help with peoples research hopefully. (the domain is still replicating)
    OR the IP
    Let me know what you think..

  7. Jackson says:

    Nifty tool man…is that ALL the results?

  8. mike says:

    yes, this is all of the results…some of the AOL search sites do not have all of the results but this one does

  9. CypherXero says:

    That’s why they invented grep.

  10. A Wiki has been set up for collaborative analysis and sharing of results from analyzing these data – please contribute.