Unsupervised Learning With Random Forest Predictors

@article{Shi2006UnsupervisedLW,
  title={Unsupervised Learning With Random Forest Predictors},
  author={Tao Shi and Steve Horvath},
  journal={Journal of Computational and Graphical Statistics},
  year={2006},
  volume={15},
  pages={118 - 138}
}
  • T. Shi, S. Horvath
  • Published 1 March 2006
  • Computer Science
  • Journal of Computational and Graphical Statistics
A random forest (RF) predictor is an ensemble of individual tree predictors. As part of their construction, RF predictors naturally lead to a dissimilarity measure between the observations. One can also define an RF dissimilarity measure between unlabeled data: the idea is to construct an RF predictor that distinguishes the “observed” data from suitably generated synthetic data. The observed data are the original unlabeled data and the synthetic data are drawn from a reference distribution… 
Unsupervised feature selection with ensemble learning
TLDR
Empirical results are provided indicating that RCE, boosted with a recursive feature elimination scheme (RFE) can lead to significant improvement in terms of clustering accuracy, over several state-of-the-art supervised and unsupervised algorithms, with a very limited subset of features.
Similarity Kernel and Clustering via Random Projection Forests
TLDR
The theoretical analysis reveals a highly desirable property of rpf-kernel: far-away (dissimilar) points have a low similarity value while nearby (similar) points would have a high similarity value, and the similarities have a native interpretation as the probability of points remaining in the same leaf nodes during the growth of r pForests.
Cluster Forests
Cluster ensemble based on Random Forests for genetic data
TLDR
It is illustrated that applying a cluster ensemble approach, combining multiple RF clusterings, produces more robust and higher-quality results as a consequence of feeding the ensemble with diverse views of high-dimensional genetic data obtained through bagging and random subspace, the two key features of the RF algorithm.
Visualizing Random Forest with Self-Organising Map
TLDR
A novel method based on Self-Organising Maps (SOM) for revealing intrinsic relationships in data that lay inside the RF used for classification tasks is presented and improves accuracy of the SOM.
Geometry- and Accuracy-Preserving Random Forest Proximities
TLDR
This paper proves that the proximity-weighted sum (regression) or majority vote (classification) using RF-GAP exactly match the out-of-bag random forest prediction, thus capturing the data geometry learned by the random forest.
Cross-Cluster Weighted Forests
TLDR
It is found that constructing ensembles of forests trained on clusters determined by algorithms such as k-means results in significant improvements in accuracy and generalizability over the traditional Random Forest algorithm.
Rule Extraction from Random Forest: the RF+HC Methods
TLDR
Experimental results show that the proposed RF+HC methods for rule extraction from RF outperform one of the state-of-the-art methods in terms of scalability and comprehensibility while preserving the same level of accuracy.
Probabilistic Random Forest: A machine learning algorithm for noisy datasets
TLDR
Apart from improving the prediction accuracy in noisy data sets, the PRF naturally copes with missing values in the data, and outperforms RF when applied to a data set with different noise characteristics in the training and test sets, suggesting that it can be used for transfer learning.
Partition Maps
TLDR
It is shown that Homogeneity Analysis, a technique mostly used in psychometrics, can be leveraged to provide interesting and meaningful visualizations of tree ensemble predictions and analyzed advantages and shortcomings compared with multidimensional scaling of proximity matrices.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 31 REFERENCES
Random Forests
TLDR
Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Classification and Regression by randomForest
TLDR
random forests are proposed, which add an additional layer of randomness to bagging and are robust against overfitting, and the randomForest package provides an R interface to the Fortran programs by Breiman and Cutler.
Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma
TLDR
This is the first tumor class discovery analysis of renal cell carcinoma patients based on protein expression profiles and the resulting molecular grouping provides better prediction of survival than this classical pathological grouping.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
TLDR
This book is a valuable resource, both for the statistician needing an introduction to machine learning and related Ž elds and for the computer scientist wishing to learn more about statistics, and statisticians will especially appreciate that it is written in their own language.
Objective Criteria for the Evaluation of Clustering Methods
TLDR
This article proposes several criteria which isolate specific aspects of the performance of a method, such as its retrieval of inherent structure, its sensitivity to resampling and the stability of its results in the light of new data.
Nonparametric Estimation from Incomplete Observations
Abstract In lifetesting, medical follow-up, and other fields the observation of the time of occurrence of the event of interest (called a death) may be prevented for some of the items of the sample
Gene Expression Profiling of Gliomas Strongly Predicts Survival
TLDR
It is found that gene expression-based grouping of tumors is a more powerful survival predictor than histologic grade or age and a list of 44 genes whose expression patterns reliably classify gliomas into previously unrecognized biological and prognostic groups are described.
Systematic variation in gene expression patterns in human cancer cell lines
TLDR
Using cDNA microarrays to explore the variation in expression of approximately 8,000 unique genes among the 60 cell lines used in the National Cancer Institute's screen for anti-cancer drugs provided a novel molecular characterization of this important group of human cell lines and their relationships to tumours in vivo.
Finding Groups in Data: An Introduction to Chster Analysis
TLDR
This book make understandable the cluster analysis is based notion of starsmodern treatment, which efficiently finds accurate clusters in data and discusses various types of study the user set explicitly but also proposes another.
Global histone modification patterns predict risk of prostate cancer recurrence
TLDR
Widespread changes in specific histone modifications indicate previously undescribed molecular heterogeneity in prostate cancer and might underlie the broad range of clinical behaviour in cancer patients.
...
1
2
3
4
...