Carolin Strobl

Learn More
Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are(More)
Random forests are becoming increasingly popular in many scientific fields because they can cope with "small n large p" problems, complex interactions and even highly correlated predictor variables. Their variable importance measures have recently been suggested as screening tools for, e.g., gene expression studies. However, these variable importance(More)
Recursive partitioning methods have become popular and widely used tools for nonparametric regression and classification in many scientific fields. Especially random forests, which can deal with large numbers of predictor variables even in the presence of complex interactions, have been applied successfully in genetics, clinical medicine, and bioinformatics(More)
Random forests have become a widely-used predictive model in many scientific disciplines within the past few years. Additionally, they are increasingly popular for assessing variable importance, e.g., in genetics and bioinformatics. We highlight both advantages and limitations of different variable importance scores and associated testing procedures,(More)
Most genetic diseases are complex, i.e. associated to combinations of SNPs rather than individual SNPs. In the last few years, this topic has often been addressed in terms of SNP-SNP interaction patterns given as expressions linked by logical operators. Methods for multiple testing in high-dimensional settings can be applied when many SNPs are considered(More)
A variety of statistical methods have been suggested for detecting differential item functioning (DIF) in the Rasch model. Most of these methods are designed for the comparison of pre-specified focal and reference groups, such as males and females. Latent class approaches, on the other hand, allow the detection of previously unknown groups exhibiting DIF.(More)
Classification trees are a popular statistical tool with multiple applications. Recent advancements of traditional classification trees, such as the approach of classification trees based on imprecise probabilities by Abellán and Moral (2004), effectively address their tendency to overfitting. However, another flaw inherent in traditional classification(More)
BACKGROUND Polymorphisms in the serine protease inhibitor gene serine peptidase inhibitor Kazal type 5 (SPINK5) and the serine protease kallikrein-related peptidase 7 gene (KLK7) appear to confer risk to eczema in some cohorts, but these findings have not been widely replicated. These genes encode proteins thought to be involved in the regulation of(More)
The random forest (RF) method is a commonly used tool for classification with high dimensional data as well as for ranking candidate predictors based on the so-called random forest variable importance measures (VIMs). However the classification performance of RF is known to be suboptimal in case of strongly unbalanced data, i.e. data where response class(More)
BACKGROUND In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. This strategy may induce a substantial optimistic bias in prediction(More)