Qingyao Wu

Learn More
For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative features. In this paper we propose a stratified sampling method to(More)
Multilabel learning aims to predict labels of unseen instances by learning from training samples that are associated with a set of known labels. In this paper, we propose to use a hierarchical tree model for multilabel learning, and to develop the ML-Tree algorithm for finding the tree structure. ML-Tree considers a tree as a hierarchy of data and(More)
In this paper, we propose a new Random Forest (RF) based ensemble method, ForesTexter, to solve the imbalanced text categorization problems. R-F has shown great success in many real-world applications. However, the problem of learning from text data with class imbalance is a relatively new challenge that needs to be addressed. A RF algorithm tends to use a(More)
Automated assignment of functions to unknown proteins is one of the most important task in computational biology. The development of experimental methods for genome scale analysis of molecular interaction networks offers new ways to infer protein function from protein-protein interaction (PPI) network data. Existing techniques for collective classification(More)
With the rapid accumulation of proteomic and genomic datasets in terms of genome-scale features and interaction networks through high-throughput experimental techniques, the process of manual predicting functional properties of the proteins has become increasingly cumbersome, and computational methods to automate this annotation task are urgently needed.(More)
In recent years much effort has been devoted to Collective Classification (CC) techniques for predicting labels of linked instances. Given a large number of labeled data, conventional CC algorithms can make use of local labeled neighbours to increase accuracy. However, in many real-world applications , labeled data are limited and very expensive to obtain.(More)
Single-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. The problem is difficult because genome-wide association data is very high dimensional and a large portion of SNPs in the data is irrelevant to the disease. Advanced machine learning methods have been successfully used(More)
Predicting functional properties of proteins in protein-protein interaction (PPI) networks presents a challenging problem and has important implication in computational biology. Collective classification (CC) that utilizes both attribute features and relational information to jointly classify related proteins in PPI networks has been shown to be a powerful(More)