Learn More
Treatment of pediatric acute lymphoblastic leukemia (ALL) is based on the concept of tailoring the intensity of therapy to a patient's risk of relapse. To determine whether gene expression profiling could enhance risk assignment, we used oligonucleotide microarrays to analyze the pattern of genes expressed in leukemic blasts from 360 pediatric ALL patients.(More)
Feature selection plays an important role in classification. We present a comparative study on six feature selection heuristics by applying them to two sets of data. The first set of data are gene expression profiles from Acute Lymphoblastic Leukemia (ALL) patients. The second set of data are proteomic patterns from ovarian cancer patients. Based on(More)
Many semistructured objects are similarly, though not identically, structured. We study the problem of discovering \typical" substructures of a collection of semistructured objects. The discovered structures can serve the following purposes: (a) the \table-of-contents" for gaining general information of a source, (b) a road map for browsing and querying(More)
We introduce a new method, called CS4, to construct committees of decision trees for classification. The method considers different top-ranked features as the root nodes of member trees. This idea is particularly suitable for dealing with high-dimensional bio-medical data as top-ranked features in this type of data usually possess similar merits for(More)
METHODS AND RESULTS We introduce a new method to discover many diversified and significant rules from high dimensional profiling data. We also propose to aggregate the discriminating power of these rules for reliable predictions. The discovered rules are found to contain low-ranked features; these features are found to be sometimes necessary for classifiers(More)
MOTIVATIONS AND RESULTS For classifying gene expression profiles or other types of medical data, simple rules are preferable to non-linear distance or kernel functions. This is because rules may help us understand more about the application in addition to performing an accurate classification. In this paper, we discover novel rules that describe the gene(More)
We describe a methodology, as well as some related data mining tools, for analyzing sequence data. The methodology comprises three steps: (a) generating candidate features from the sequences, (b) selecting relevant features from the candidates, and (c) integrating the selected features to build a system to recognize specific properties in sequence data. We(More)
BACKGROUND MicroRNA regulate mRNA levels in a tissue specific way, either by inducing degradation of the transcript or by inhibiting translation or transcription. Putative mRNA targets of microRNA identified from seed sequence matches are available in many databases. However, such matches have a high false positive rate and cannot identify tissue(More)
This paper presents a machine learning method to predict polyadenylation signals (PASes) in human DNA and mRNA sequences by analysing features around them. This method consists of three sequential steps of feature manipulation: generation, selection and integration of features. In the first step, new features are generated using k-gram nucleotide acid or(More)