Learn More
Markov networks are widely used in a wide variety of applications, in problems ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to the lack of effective algorithms for learning Markov(More)
L 1 regularized logistic regression is now a workhorse of machine learning: it is widely used for many classification problems, particularly ones with many features. L 1 regularized logistic regression requires solving a convex optimization problem. However, standard algorithms for solving convex optimization problems do not scale well enough to handle the(More)
The aspergilli comprise a diverse group of filamentous fungi spanning over 200 million years of evolution. Here we report the genome sequence of the model organism Aspergillus nidulans, and a comparative study with Aspergillus fumigatus, a serious human pathogen, and Aspergillus oryzae, used in the production of sake, miso and soy sauce. Our analysis of(More)
We apply linear and nonlinear independent component analysis (ICA) to project microarray data into statistically independent components that correspond to putative biological processes, and to cluster genes according to over- or under-expression in each component. We test the statistical significance of enrichment of gene annotations within clusters. ICA(More)
The functional consequences of genetic variation in mammalian regulatory elements are poorly understood. We report the in vivo dissection of three mammalian enhancers at single-nucleotide resolution through a massively parallel reporter assay. For each enhancer, we synthesized a library of >100,000 mutant haplotypes with 2-3% divergence from the wild-type(More)
Genome-wide RNA expression data provide a detailed view of an organism's biological state; hence, a dataset measuring expression variation between genetically diverse individuals (eQTL data) may provide important insights into the genetics of complex traits. However, with data from a relatively small number of individuals, it is difficult to distinguish(More)
We consider the problem of estimating high-dimensional Gaussian graphical models corresponding to a single set of variables under several distinct conditions. This problem is motivated by the task of recovering transcriptional regulatory networks on the basis of gene expression data containing heterogeneous samples, such as different disease states,(More)
Sequence polymorphisms affect gene expression by perturbing the complex network of regulatory interactions. We propose a probabilistic method, called Geronemo, which directly aims to identify the mechanism by which genetic changes perturb the regulatory network. Geronemo automatically constructs a set of coregulated genes (modules), whose regulation can(More)
In many prediction tasks, selecting relevant features is essential for achieving good generalization performance. Most feature selection algorithms consider all features to be a priori equally likely to be relevant. In this paper, we use transfer learning---learning on an ensemble of related tasks---to construct an informative prior on feature relevance. We(More)
We introduce a new approach to learning statistical models from multiple sequence alignments (MSA) of proteins. Our method, called GREMLIN (Generative REgularized ModeLs of proteINs), learns an undirected probabilistic graphical model of the amino acid composition within the MSA. The resulting model encodes both the position-specific conservation statistics(More)