Learn More
Treatment of pediatric acute lymphoblastic leukemia (ALL) is based on the concept of tailoring the intensity of therapy to a patient's risk of relapse. To determine whether gene expression profiling could enhance risk assignment, we used oligonucleotide microarrays to analyze the pattern of genes expressed in leukemic blasts from 360 pediatric ALL patients.(More)
Feature selection plays an important role in classification. We present a comparative study on six feature selection heuristics by applying them to two sets of data. The first set of data are gene expression profiles from Acute Lymphoblastic Leukemia (ALL) patients. The second set of data are proteomic patterns from ovarian cancer patients. Based on(More)
Many semistructured objects are similarly, though not identically, structured. We study the problem of discovering \typical" substructures of a collection of semistructured objects. The discovered structures can serve the following purposes: (a) the \table-of-contents" for gaining general information of a source, (b) a road map for browsing and querying(More)
Human ESCs (hESCs) respond to signals that determine their pluripotency, proliferation, survival, and differentiation status. In this report, we demonstrate that phosphatidylinositol 3-kinase (PI3K) antagonizes the ability of hESCs to differentiate in response to transforming growth factor beta family members such as Activin A and Nodal. Inhibition of PI3K(More)
The translation initiation site (TIS) prediction problem is about how to correctly identify TIS in mRNA, cDNA, or other types of genomic sequences. High prediction accuracy can be helpful in a better understanding of protein coding from nucleotide sequences. This is an important step in genomic analysis to determine protein coding from nucleotide sequences.(More)
Clear cell renal cell carcinoma (ccRCC) is the predominant RCC subtype, but even within this classification, the natural history is heterogeneous and difficult to predict. A sophisticated understanding of the molecular features most discriminatory for the underlying tumor heterogeneity should be predicated on identifiable and biologically meaningful(More)
We describe a methodology, as well as some related data mining tools, for analyzing sequence data. The methodology comprises three steps: (a) generating candidate features from the sequences, (b) selecting relevant features from the candidates, and (c) integrating the selected features to build a system to recognize specific properties in sequence data. We(More)
To formulate a meaningful query on semistructured data, such as on the Web, that matches some of the source’s structure, we need first to discover something about how the information is represented in the source. This is referred to as schema discovery and was considered for a single object recently. In the case of multiple objects, the task of schema(More)
We introduce a new method, called CS4, to construct committees of decision trees for classification. The method considers different top-ranked features as the root nodes of member trees. This idea is particularly suitable for dealing with high-dimensional bio-medical data as top-ranked features in this type of data usually possess similar merits for(More)