Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR.

  title={Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR.},
  author={Howard D. Bondell and Brian J. Reich},
  volume={64 1},
Variable selection can be challenging, particularly in situations with a large number of predictors with possibly high correlations, such as gene expression data. In this article, a new method called the OSCAR (octagonal shrinkage and clustering algorithm for regression) is proposed to simultaneously select variables while grouping them into predictive clusters. In addition to improving prediction accuracy and interpretation, these resulting groups can then be investigated further to discover… 
Regression shrinkage and grouping of highly correlated predictors with HORSES
Identifying homogeneous subgroups of variables can be challenging in high dimensional data analysis with highly correlated predictors. We propose a new method called Hexagonal Operator for Regression
Regularization and Estimation in Regression with Cluster Variables
Clustering Lasso, a new regularization method for linear regressions is proposed in the paper. The Clustering Lasso can select variable while keeping the correlation structures among variables. In
The Cluster Elastic Net for High-Dimensional Regression With Unknown Variable Grouping
This work proposes the cluster elastic net, which selectively shrinks the coefficients for such variables toward each other, rather than toward the origin, in the high-dimensional regression setting.
Consistent Group Identification and Variable Selection in Regression With Correlated Predictors
  • Dhruv B. Sharma, H. Bondell, Hao Helen Zhang
  • Computer Science, Medicine
    Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America
  • 2013
A penalization procedure is proposed that performs variable selection while clustering groups of predictors automatically, and compares favorably with existing selection approaches in both prediction accuracy and model discovery, while retaining its computational efficiency.
A Bayesian Approach to Multicollinearity and the Simultaneous Selection and Clustering of Predictors in Linear Regression
High correlation among predictors has long been an annoyance in regression analysis. The crux of the problem is that the linear regression model assumes each predictor has an independent effect on
High-Dimensional Regression and Variable Selection Using CAR Scores
Variable selection is a difficult problem that is particularly challenging in the analysis of high-dimensional genomic data. Here, we introduce the CAR score, a novel and highly effective criterion
MCEN: a method of simultaneous variable selection and clustering for high-dimensional multinomial regression
A novel penalty function that incorporates both regression coefficients and pairwise correlation to define clusters of variables is used and provides a one-stop solution to select and group important variables associated with different classes of multinomial response at the same time.
Penalized regression combining the L1 norm and a correlation based penalty.
We consider the problem of feature selection in linear regression model with p covariates and n observations. We propose a new method to simultaneously select variables and favor a grouping effect,
An extended variable inclusion and shrinkage algorithm for correlated variables
A new method is proposed to simultaneously select variables and encourage a grouping effect where strongly correlated predictors tend to be in or out of the model together, which is capable of selecting a sparse model while avoiding the overshrinkage of a Lasso-type estimator.
Group variable selection for data with dependent structures
Variable selection methods have been widely used in the analysis of high-dimensional data, for example, gene expression microarray data and single nucleotide polymorphism data. A special feature of


Model selection and estimation in regression with grouped variables
Summary. We consider the problem of selecting grouped variables (factors) for accurate prediction in regression. Such a problem arises naturally in many practical situations with the multifactor
Regression Shrinkage and Selection via the Lasso
SUMMARY We propose a new method for estimation in linear models. The 'lasso' minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a
Finding predictive gene groups from microarray data
Microarray experiments generate large datasets with expression values for thousands of genes, but not more than a few dozens of samples. A challenging task with these data is to reveal groups of
Simultaneous Gene Clustering and Subset Selection for Sample Classification Via MDL
An algorithm for the simultaneous clustering of genes and subset selection of gene clusters for sample classification is presented and a new model selection criterion based on Rissanen's MDL (minimum description length) principle is developed.
Sparsity and smoothness via the fused lasso
Summary. The lasso penalizes a least squares regression by the sum of the absolute values (L1-norm) of the coefficients. The form of this penalty encourages sparse solutions (with many coefficients
Regularization and variable selection via the elastic net
Summary. We propose the elastic net, a new regularization and variable selection method. Real world data and a simulation study show that the elastic net often outperforms the lasso, while enjoying a
Supervised harvesting of expression trees
It is found that the procedure may require a large number of experimental samples to successfully discover interactions, and is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worthy of further investigation.
Averaged gene expressions for regression.
By averaging the genes within the clusters obtained from hierarchical clustering, supergenes are defined and used to fit regression models, thereby attaining concise interpretation and accuracy in regression of DNA microarray data.
Ridge regression: biased estimation for nonorthogonal problems
In multiple regression it is shown that parameter estimates based on minimum residual sum of squares have a high probability of being unsatisfactory, if not incorrect, if the prediction vectors are
Piecewise linear regularized solution paths
We consider the generic regularized optimization problem β(λ) = argminβ L(y, Xβ) + λJ(β). Efron, Hastie, Johnstone and Tibshirani [Ann. Statist. 32 (2004) 407-499] have shown that for the LASSO-that