Stupid Data Miner Tricks

  title={Stupid Data Miner Tricks},
  author={David Leinweber},
This article originated over ten years ago as a set of joke slides showing silly spurious correlations. These statistically appealing relationships between the stock market and diary products and third world livestock populations have been cited often, in Business Week, the Wall Street Journal, the book “A Mathematician Looks at the Stock Market,” and elsewhere. Students from Bill Sharpe's classes at Stanford seem to be familiar with them. The slides were expanded to include some actual content… Expand
Predicting Financial Markets with Google Trends and Not so Random Keywords
We check the claims that data from Google Trends contain enough data to predict future financial index returns. We first discuss the many subtle (and less subtle) biases that may affect the backtestExpand
Big Data Mining: An Overview
Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information. Although big data doesn'tExpand
Event-Driven Trading and the “New News”
Two information revolutions are underway in trading and investing. Most headlines focus on structured quantitative market information at ever higher frequencies, but the other technology revolutionExpand
The era of Big Data has begun. Computer scientists, physicists, economists, mathematicians, political scientists, bio-informaticists, sociologists, and other scholars are clamoring for access to theExpand
The Golden Dilemma
While gold objects have existed for thousands of years, gold's role in diversified portfolios is not well understood. We critically examine popular stories such as 'gold is an inflation hedge'. WeExpand
Do Google Trend Data Contain More Predictability than Price Returns?
Using non-linear machine learning methods and a proper backtest procedure, we critically examine the claim that Google Trends can predict future price returns. We first review the many potentialExpand
Gold, the Golden Constant, and Déjà Vu
Currently, the real, or inflation-adjusted, price of gold is almost as high as it was in January 1980 and August 2011. Since 1975, periods of high real gold prices have occurred during periods ofExpand
Data, Data, Everywhere
The amount of data available combined with the number of variables that need to be considered is of a scale far beyond what is amenable to manual inspection, and automated and semi-automated data analysis is thus essential to sieve through the data for meaningful conclusions. Expand
What Big Data May Mean for Surveys
Two converging trends raise questions about the future of large-scale probability surveys conducted by or for National Statistical Institutes (NSIs). First, increasing costs and rising rates ofExpand
Automated algorithmic trading: machine learning and agent-based modelling in complex adaptive financial markets
An autonomous system that uses novel machine learning techniques to predict the price return over well documented seasonal events and uses these predictions to develop a profitable trading strategy and an adaptation of the system introduced for predicting the price impact of order book events are proposed. Expand


Data Mining: Statistics and More?
Abstract Data mining is a new discipline lying at the interface of statistics, database technology, pattern recognition, machine learning, and other areas. It is concerned with the secondary analysisExpand
Data-Snooping Biases in Tests of Financial Asset Pricing Models
We investigate the extent to which tests of financial asset pricing models may be biased by using properties of the data to construct the test statistics. Specifically, we focus on tests usingExpand
Selection Models and the File Drawer Problem
Meta-analysis consists of quantitative methods for combining evidence from different studies about a particular issue. A frequent criticism of meta-analysis is that it may be based on a biased sampleExpand
Maximizing Predictability in the Stock and Bond Markets
We construct portfolios of stocks and of bonds that are maximally predictable with respect to a set of ex ante observable economic variables, and show that these levels of predictability areExpand
A Note on Screening Regression Equations
Abstract Consider developing a regression model in a context where substantive theory is weak. To focus on an extreme case, suppose that in fact there is no relationship between the dependentExpand
The Theory and Practice of Econometrics
The Classical Inference Approach for the General Linear Model, Statistical Decision Theory and Biased Estimation, and the Bayesian Approach to Inference are reviewed. Expand
Behind the Smoke and Mirrors: Gauging the Integrity of Investment Simulations
Fund sponsors and others who must evaluate simulated investment results should carefully question the simulation process. In particular, they should ask about the data base used, the portfolioExpand
Specification Searches
  • 1978