Interesting looking paper on data mining mistakes — Stupid Data Miner Tricks: Overfitting the S&P 500 — in the current issue of the Journal of Investing. I think I read the 1997 First Quadrant monograph by David Leinweber, but I’ll see if I can track down a copy of the new paper. It was in that paper, if I’m remembering correctly, that Leinweber “discovered” in mining some U.N. data the the best predictor of the S&P 500 was Bangladesh butter prices.
This article originated over ten years ago as a set of joke slides showing silly spurious correlations. … Without taking a hatchet to the original, the advice offered remains valuable, perhaps even more so now that there is so much more data to mine. Monthly data arrives as a single data point, once a month. It’s hard to avoid data mining sins if you look twice. Ticks, quotes, and executions arrive in millions per minute, and many of the practices which fail the statistical sniff tests for low frequency data can now be used responsibly. Nevertheless, fooling yourself remains an occupational hazard in quantitative trading.