Feature Selection in Proteomic Pattem Data with Support Vector Machines

  • Published 2017

Abstract

This paper introduces novel methods for feature selec­ tion (FS) based on support vector machines (SVM). The methods combine feature subsets produced by a variant of SVM-RFE, a popular feature ranking/selection algorithm based on SVM. Two combination strategies are proposed: union of features occurring frequently, and ensemble of classifiers built on single feature subsets. The resulting methods are applied to pattern proteomic data for tumor diagnostics. Results of experiments on three proteomic pattern datasets indicate that combining feature subsets affects positively the prediction accuracy of both SVM and SVM-RFE. A discussion about the biological interpretation of selected features is provided. I. I n t r o d u c t i o n FS can be formalized as a combinatorial optimization problem, finding the feature set maximizing the quality of the hypothesis learned from these features. FS is viewed as a major bottleneck of supervised learning and data mining [1], [2]. For the sake of the learning performance, it is highly desirable to discard irrelevant features prior to learning, especially when the number of available features significantly outnumbers the number of examples, as is the case in bioinformatics. In particular, biological experiments from laboratory tech­ nologies like microarray and proteomic techniques, generate data with very high number of attributes, in general much larger than the number of examples. Therefore FS provides a fundamental step in the analysis of such type of data [3]. By selecting only a subset of attributes, the prediction accuracy can possibly improve and more insight in the nature of the prediction problem can be gained. A number of effective FS methods for classification rank features and discard those whose rank is smaller than a given threshold [1], [4]. This threshold can be either provided by the user, like in [5], or automatically determined, like in [6], by means of the estimated rank of a new random feature. A popular algorithm based on the above approach is SVMRFE [5]. It is an iterative algorithm. Each iteration consists of the following two steps. First feature weights, obtained by training a linear SVM on the training set, are used in a scoring function for ranking features. Next, the feature with minimum rank is removed from the data. In this way, a chain of feature subsets of decreasing size is obtained. SVM classifiers are trained on training sets restricted to the feature subsets, and the classifier with best predictive performance is selected. In the original SVM-RFE algorithm one feature is discarded at each iteration. Other choices are suggested in [5], where at each iteration features with rank lower than a user-given theshold are removed. The choice of the threshold affects the results of SVM-RFE. Heuristics for choosing a threshold value have been proposed [5], [6]. In this paper the problem of choosing a threshold is sidestepped by considering multiple runs of SVM-RFE with different thresholds. Each run produces one feature subset. The resulting feature subsets are combined in order to obtain a robust result/classification. Two methods for building a classifier from a combination of feature subsets are proposed, called JOIN and ENSEMBLE. JOIN generates a classifier by training SVM on data restricted to those features that occur more than a given number of times in the list of feature subsets. ENSEMBLE generates a majority vote ensemble of classifiers, where each classifier is obtained by training SVM on data restricted to one feature subset. This combination strategy is used, e.g., in [7], where decision trees trained on data restricted to randomly selected feature subsets are ensembled. JOIN and ENSEMBLE are compared experimentally with SVM trained on all features, and with a multistart version of SVM-RFE. Multistart SVM-RFE performs multiple runs of SVM-RFE with different thresholds, and selects among the resulting feature subsets the one minimizing the error (on hold-out set) of SVM trained on data restricted to that feature subset. The four methods are applied to pattern proteomic data from cancer and healthy patients. This type of data is used for cancer detection and potential biomarker identification. Motivations for choosing FS methods based on linear SVM are their robustness with respect to high dimension input data, and the experimental observation that such data appear to be almost linearly separable (see e.g., [8], [9]). Experiments are conducted on three pattern proteomic data from prostate and ovarian cancer. On two of the three datasets JOIN and ENSEMBLE achieve significantly better predic­ tive accuracy than SVM and multistart SVM-RFE. On the third dataset JOIN obtains perfect classification and the other methods almost perfect classification. The results indicate that FS methods combining feature subsets from multiple runs provide a robust and effective approach for feature selection in proteomic pattern data. The paper is organized as follows. Section II gives an overview of the considered FS methodology. Section III describes the data used in the experiments. Section IV reports on results of the experiments. The paper ends with a discussion and points to future research.

14 Figures and Tables

Cite this paper

@inproceedings{2017FeatureSI, title={Feature Selection in Proteomic Pattem Data with Support Vector Machines}, author={}, year={2017} }