Learn More
We present a fast and flexible program for clustering large protein databases at different sequence identity levels. It takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560,000 sequences on a high-end PC. The output database, including only the representative sequences, can be used(More)
MOTIVATION Sequence clustering replaces groups of similar sequences in a database with single representatives. Clustering large protein databases like the NCBI Non-Redundant database (NR) using even the best currently available clustering algorithms is very time-consuming and only practical at relatively high sequence identity thresholds. Our previous(More)
The boreal forests, identified as a critical "tipping element" of the Earth's climate system, play a critical role in the global carbon budget. Recent findings have suggested that terrestrial carbon sinks in northern high-latitude regions are weakening, but there has been little observational evidence to support the idea of a reduction of carbon sinks in(More)
Due to its diverse, wondrous plants and unique topography, Western China has drawn great attention from explorers and naturalists from the Western World. Among them, Ernest Henry Wilson (1876 -1930), known as 'Chinese' Wilson, travelled to Western China five times from 1899 to 1918. He took more than 1,000 photos during his travels. These valuable photos(More)
Over the last decade, workflows have been established as a mechanism for scientific developers to create simplified views of complex scientific processes. However, there is a need for a comprehensive system architecture to link scientific developers creating workflows with researchers launching workflows in large scale computing environments. We present the(More)
Microbial communities that live on the outside and inside of the human body dramatically influence human health and diseases. In recent years, major progress has been made in understanding the human microbiome communities through projects such as the Human Microbiome Project (http://commonfund.nih.gov/hmp/), using next generation sequencing technologies and(More)
MOTIVATION Two proteins can have a similar 3-dimensional structure and biological function, but have sequences sufficiently different that traditional protein sequence comparison algorithms do not identify their relationship. The desire to identify such relations has led to the development of more sensitive sequence alignment strategies. One such strategy(More)
Observations from human microbiome studies are often conflicting or inconclusive. Many factors likely contribute to these issues including small cohort sizes, sample collection, and handling and processing differences. The field of microbiome research is moving from 16S rDNA gene sequencing to a more comprehensive genomic and functional representation(More)
Next-generation DNA sequencing machines are generating a very large amount of sequence data with applications in many scientific challenges and placing unprecedented demands on traditional single-processor bioinformatics algorithms. Middleware and technologies for scientific workflows and data-intensive computing promise new capabilities to enable rapid(More)
In the Big Data era, workflow systems need to embrace data parallel computing techniques for efficient data analysis and analytics. Here, the authors present an easy-to-use, scalable approach to build and execute Big Data applications using actor-oriented modeling in data parallel computing. They use two bioinformatics use cases for next-generation(More)