Stupid Data Miner Tricks

  title={Stupid Data Miner Tricks},
  author={David Leinweber},
This article originated over ten years ago as a set of joke slides showing silly spurious correlations. These statistically appealing relationships between the stock market and diary products and third world livestock populations have been cited often, in Business Week, the Wall Street Journal, the book “A Mathematician Looks at the Stock Market,” and elsewhere. Students from Bill Sharpe's classes at Stanford seem to be familiar with them. The slides were expanded to include some actual content… 

Big Data Mining: An Overview

A HACE theorem is presented that characterizes the features of the Big Data revolution and enables companies to "drill down" into summary information to view detail transactional data.

The Golden Dilemma

Although gold has been around for thousands of years, its role in diversified portfolios is not well understood. The authors critically examined such popular stories as “gold is an inflation hedge.”

Event-Driven Trading and the “New News”

In this article, Leinweber and Sisk include event studies and show U.S. portfolio simulation results for “pure news” signals applied over the period 2006–2009 as well as a true out-of-sample period in 2010, which indicates alpha in excess of 10% a year.


The era of Big Data has begun. Computer scientists, physicists, economists, mathematicians, political scientists, bio-informaticists, sociologists, and other scholars are clamoring for access to the

Gold, the Golden Constant, and Déjà Vu

Currently, the real, or inflation-adjusted, price of gold is almost as high as it was in January 1980 and August 2011. Since 1975, periods of high real gold prices have occurred during periods of

Data, Data, Everywhere

What Big Data May Mean for Surveys

Two converging trends raise questions about the future of large-scale probability surveys conducted by or for National Statistical Institutes (NSIs). First, increasing costs and rising rates of

Automated algorithmic trading: machine learning and agent-based modelling in complex adaptive financial markets

An autonomous system that uses novel machine learning techniques to predict the price return over well documented seasonal events and uses these predictions to develop a profitable trading strategy and an adaptation of the system introduced for predicting the price impact of order book events are proposed.

A Perceptron Based Neural Network Data Analytics Architecture for the Detection of Fraud in Credit Card Transactions in Financial Legacy Systems

The paper examines the feasibility and practicality of implementing a proof-of-concept Perceptron-based Artificial Neural Network (ANN) architecture that can be directly plugged into a legacy paradigm financial system platform that has been trained on specific fraudulent patterns.

Big Data Techniques and Applications

In this chapter, past and current research on big data techniques and its applications are reviewed.



Data Mining: Statistics and More?

Abstract Data mining is a new discipline lying at the interface of statistics, database technology, pattern recognition, machine learning, and other areas. It is concerned with the secondary analysis

Data-Snooping Biases in Tests of Financial Asset Pricing Models

We investigate the extent to which tests of financial asset pricing models may be biased by using properties of the data to construct the test statistics. Specifically, we focus on tests using

Selection Models and the File Drawer Problem

This paper uses selection models, or weighted distributions, to deal with one source of bias, namely the failure to report studies that do not yield statistically significant results, and applies selection models to two approaches that have been suggested for correcting the bias.


We construct portfolios of stocks and bonds that are maximally predictable with respect to a set of ex-ante observable economic variables, and show that these levels of predictability are

A Note on Screening Regression Equations

Abstract Consider developing a regression model in a context where substantive theory is weak. To focus on an extreme case, suppose that in fact there is no relationship between the dependent

The Theory and Practice of Econometrics

The Classical Inference Approach for the General Linear Model, Statistical Decision Theory and Biased Estimation, and the Bayesian Approach to Inference are reviewed.

Behind the Smoke and Mirrors: Gauging the Integrity of Investment Simulations

Fund sponsors and others who must evaluate simulated investment results should carefully question the simulation process. In particular, they should ask about the data base used, the portfolio