Bump hunting in high-dimensional data

  title={Bump hunting in high-dimensional data},
  author={Jerome H. Friedman and Nicholas I. Fisher},
  journal={Statistics and Computing},
Many data analytic questions can be formulated as (noisy) optimization problems. They explicitly or implicitly involve finding simultaneous combinations of values for a set of (“input”) variables that imply unusually large (or small) values of another designated (“output”) variable. Specifically, one seeks a set of subregions of the input variable space within which the value of the output variable is considerably larger (or smaller) than its average value over the entire input domain. In… 

Local Sparse Bump Hunting

  • J. DazardJ. S. Rao
  • Computer Science
    Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America
  • 2010
This work introduces a novel supervised and multivariate bump hunting strategy for exploring modes or classes of a target function of many continuous variables, which outperforms a naive PRIM as well as competitive nonparametric supervised and unsupervised methods in the problem of class discovery.

Mixtures of Rectangles: Interpretable Soft Clustering

This work explores a clustering technique that requires no user-supplied parameters except for the desired number of clusters, and demonstrates the usefulness of the method in subspace clustering for synthetic data, and in real-life datasets.

Data Exploration by Representative Region Selection: Axioms and Convergence

A new type of unsupervised learning problem is presented in which a small set of representative regions are found that approximates a larger data set that does not rely on cluster structure of the data.

Conditional Sparse Linear Regression

This work considers the problem of jointly identifying a significant segment of a population in which there is a highly sparse linear regression fit, together with the coefficients for the linear fit, and gives algorithms for such problems under the sup norm.

Real-valued All-Dimensions Search: Low-overhead Rapid Searching over Subsets of Attributes

A new, efficient approach to searching the combinatorial space of contingency tables during the inner loop of a nonlinear statistical optimization, called RADSEARCH (Real-valued All-Dimensions-tree Search), which finds the global optimum.

SuRF: Identification of Interesting Data Regions with Surrogate Models

The proposed framework, coined SuRF (SUrrogate Region Finder), leverages historical region evaluations to train surrogate models that learn to approximate the distribution of the statistic of interest and makes use of evolutionary multi-modal optimization to effectively and efficiently identify regions of interest regardless of data size and dimensionality.

Analysis of large-scale scalar data using hixels

A new data representation for scalar data, called hixels, that stores a histogram of values for each sample point of a domain is introduced that proposes new feature detection algorithms using a combination of topological and statistical methods.

Comparing Algorithms for Scenario Discovery

This study offers three measures of merit -coverage, density, and interpretability and uses them to evaluate the capabilities of PRIM, a bump-hunting algorithm, and CART, a classification algorithm and finds both algorithms can perform the required task, but often imperfectly.

Subgroup discovery in data sets with multi-dimensional responses

This work has developed a technique that uses a combination of agglomerative clustering to find subgroup candidates in the space of output attributes, and predictive modeling to score and describe these candidates inThe input attribute space.

Scenario Discovery via Rule Extraction

This work proposes a new procedure for scenario discovery - an intermediate statistical model which generalizes fast, and uses it to label (a lot of) data for PRIM, and shows that this method is much better than PRIM itself.



Model Search and Inference By Bootstrap "bumping

A bootstrap-based method for searching through a space of models that is well suited to complex, adaptively models and provides a convenient method fording better local minima, for resistant tting, and for optimization under constraints is proposed.

Neural Networks for Pattern Recognition

Spline Models for Observational Data

Foreword 1. Background 2. More splines 3. Equivalence and perpendicularity, or, what's so special about splines? 4. Estimating the smoothing parameter 5. 'Confidence intervals' 6. Partial spline

Projection Pursuit Regression

Abstract A new method for nonparametric multiple regression is presented. The procedure models the regression surface as a sum of general smooth functions of linear combinations of the predictor

Approximation of Functions

Theory of Approximation of Functions of a Real VariableBy A. F. Timan. Translated by J. Berry. English translation edited and editorial preface by J. Cossar. (International Series of Monographs on

Classification and Regression Trees

This chapter discusses tree classification in the context of medicine, where right Sized Trees and Honest Estimates are considered and Bayes Rules and Partitions are used as guides to optimal pruning.

Data mining and knowledge discovery: making sense out of data

Without a concerted effort to develop knowledge discovery techniques, organizations stand to forfeit much of the value from the data they currently collect and store.

Cr-Pyrope Garnets in the Lithospheric Mantle. I. Compositional Systematics and Relations to Tectonic Setting

Chrome-pyrope garnet is a minor but widespread phase in ultramafic association with Mg. The position and slope of the lherzolite trend vary with temperature and tectonic setting, suggesting that the

Pattern Recognition and Neural Networks

Title Type pattern recognition with neural networks in c++ PDF pattern recognition and neural networks PDF neural networks for pattern recognition advanced texts in econometrics PDF neural networks

The Nature of Statistical Learning Theory

  • V. Vapnik
  • Computer Science
    Statistics for Engineering and Information Science
  • 2000
Setting of the learning problem consistency of learning processes bounds on the rate of convergence of learning processes controlling the generalization ability of learning processes constructing