Learn More
We report the draft genome of the black cottonwood tree, Populus trichocarpa. Integration of shotgun sequence assembly with genetic mapping enabled chromosome-scale reconstruction of the genome. More than 45,000 putative protein-coding genes were identified. Analysis of the assembled genome revealed a whole-genome duplication event; about 8000 pairs of(More)
The availability of the assembled mouse genome makes possible, for the first time, an alignment and comparison of two large vertebrate genomes. We investigated different strategies of alignment for the subsequent analysis of conservation of genomes that are effective for assemblies of different quality. These strategies were applied to the comparison of the(More)
The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this paper, we discuss the evolution of our infrastructure and the development of capabilities for data mining on "big data". One important lesson is that successful big data mining in(More)
Comparative analysis of DNA sequences is becoming one of the major methods for discovery of functionally important genomic intervals. Presented here the VISTA family of computational tools was built to help researchers in this undertaking. These tools allow the researcher to align DNA sequences, quickly visualize conservation levels between them, identify(More)
In recent years, there has been a substantial amount of work on large-scale data analytics using Hadoop-based platforms running on large clusters of commodity machines. A less-explored topic is how those data, dominated by application logs, are collected and structured to begin with. In this paper , we present Twitter's production logging infrastructure and(More)
The EPB41 (protein 4.1) genes epitomize the resourcefulness of the mammalian genome to encode a complex proteome from a small number of genes. By utilizing alternative transcriptional promoters and tissue-specific alternative pre-mRNA splicing, EPB41, EPB41L2, EPB41L3, and EPB41L1 encode a diverse array of structural adapter proteins. Comparative genomic(More)
MapReduce, especially the Hadoop open-source implementation, has recently emerged as a popular framework for large-scale data analytics. Given the explosion of unstructured data begotten by social media and other web-based applications, we take the position that any modern analytics platform must support operations on free-text fields as first-class(More)
Recent studies in the field of comparative genomics demonstrate that multi-species DNA comparison presents a powerful method for discovering functional genomic sequences. Similarity across large evolutionary distances usually reveals conserved, and, by inference, important biological features. Expanding the capabilities of computational tools designed for(More)
  • 1