Learn More
For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and(More)
We adapt a network simulation algorithm called quantitative simulation (QSim) for use in the alignment of biological networks. Unlike most network alignment methods, QSim finds local matches for one network in another, making it asymmetric, and takes full advantage of different edge types. We use QSim to simulate a protein-protein interaction (PPI) network(More)
MOTIVATION Many entity taggers and information extraction systems make use of lists of terms of entities such as people, places, genes or chemicals. These lists have traditionally been constructed manually. We show that distributional clustering methods which group words based on the contexts that they appear in, including neighboring words and syntactic(More)
MOTIVATION Rapidly advancing genome technology has allowed access to a large number of diverse genomes and annotation data. We have defined a systems model that integrates assembly data, comparative genomics, gene predictions, mRNA and EST alignments and physiological tissue expression. Using these as predictive parameters, we engineered a machine learning(More)
A pervasive problem in large relational databases is identity uncertainty which occurs when multiple entries in a database refer to the same underlying entity in the world. Relational databases exhibit rich graphical structure and are naturally modeled as graphs whose nodes represent entities and whose typed-edges represent relations between them. We(More)
We present an EM-based clustering method that can be used for constructing or augmenting ontologies such as MeSH. Our algorithm simultaneously clusters verbs and nouns using both verb-noun and noun-noun co-occurrence pairs. This strategy provides greater coverage of words than using either set of pairs alone, since not all words appear in both datasets. We(More)
  • 1