DivergentSet, a Tool for Picking Non-redundant Sequences from Large Sequence Collections*

  title={DivergentSet, a Tool for Picking Non-redundant Sequences from Large Sequence Collections*},
  author={Jeremy Widmann and Micah Hamady and Rob Knight},
  journal={Molecular \& Cellular Proteomics},
  pages={1520 - 1532}
DivergentSet addresses the important but so far neglected bioinformatics task of choosing a representative set of sequences from a larger collection. We found that using a phylogenetic tree to guide the construction of divergent sets of sequences can be up to 2 orders of magnitude faster than the naive method of using a full distance matrix. By providing a user-friendly interface (available online) that integrates the tasks of finding additional sequences, building and refining the divergent… 

Figures from this paper

Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs
An iterative graph matching strategy where at each iteration the best gene assignments are identified resulting in a set of orthologs and co-orthologs is developed and it is found that the afree algorithm is faster than existing methods and maintains high accuracy in identifying similar genes.
Fast UniFrac: Facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data
The potential of Fast UniFrac is shown using examples from three data types: Sanger-sequencing studies of diverse free-living and animal-associated bacterial assemblages and from the gut of obese humans as they diet, pyrosequencing data integrated from studies of the human hand and gut, and PhyloChip data from a study of citrus pathogens.
MotifCluster: an interactive online tool for clustering and visualizing sequences using shared motifs
MotifCluster finds related motifs in a set of sequences, and clusters the sequences into families using the motifs they contain, and demonstrates its accuracy using gold-standard protein superfamilies.
Global patterns in bacterial diversity
This work reports the most comprehensive analysis of the environmental distribution of bacteria to date, based on 21,752 16S rRNA sequences compiled from 111 studies of diverse physical environments, and finds that sediments are more phylogenetically diverse than any other environment type.
PyCogent: a toolkit for making sense from sequence
The COmparative GENomic Toolkit is implemented in Python, a fully integrated and thoroughly tested framework for novel probabilistic analyses of biological sequences, devising workflows, and generating publication quality graphics.
Loop 7 of E2 Enzymes: An Ancestral Conserved Functional Motif Involved in the E2-Mediated Steps of the Ubiquitination Cascade
It is shown that acidic loop is a conserved ancestral motif in E2s, relying on the presence of alternate hydrophobic and acidic residues, and suggest a crucial role for L7 of family 3 E 2s in all the E2-mediated steps of the ubiquitination cascade.
Phylogeography of microbial phototrophs in the dry valleys of the high Himalayas and Antarctica
Although microbial biomass levels are as low as those of the Dry Valleys of Antarctica, there are abundant microbial photoautotrophs, displaying unexpected phylogenetic diversity, in barren soils from just below the permanent ice line of the central Himalayas, the first to demonstrate the remarkable similarities of microbial life of arid soils of Antarctica and the high Himalayan soil systems.


Removing near-neighbour redundancy from large protein sequence collections
This work clusters closely similar sequences to yield a covering of sequence space by a representative subset of sequences, derived by an exhaustive search for close similarities in the sequence database in which the need for explicit sequence alignment is significantly reduced by applying deca- and pentapeptide composition filters.
Shotgun: getting more from sequence similarity searches
The Shotgun program is developed and used to identify both new superfamily members and to reconstruct several known enzyme superfamilies using BLAST database searches and an analysis of the false-positive rates generated in the analysis and other control experiments provides evidence that high Shotgun scores indicate real evolutionary relationships.
A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins
A computer adaptable method for finding similarities in the amino acid sequences of two proteins has been developed. From these findings it is possible to determine whether significant homology
MUSCLE: multiple sequence alignment with high accuracy and high throughput.
  • R. Edgar
  • Computer Science
    Nucleic acids research
  • 2004
MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.
FastGroup: A program to dereplicate libraries of 16S rDNA sequences
The FastGroup program simplifies the dereplication of 16S rDNA sequence libraries and prepares the raw sequences for subsequent analyses.
Incomplete taxon sampling is not a problem for phylogenetic inference
  • M. Rosenberg, S. Kumar
  • Biology
    Proceedings of the National Academy of Sciences of the United States of America
  • 2001
Computer simulation studies by using natural collections of evolutionary parameters—rates of evolution, species sampling, and gene lengths—determined from data available in genomic databases suggest that longer sequences, rather than extensive sampling, will better improve the accuracy of phylogenetic inference.
Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness
A computer program, DOTUR, is developed, which assigns sequences to OTUs by using either the furthest, average, or nearest neighbor algorithm for each distance level, which addresses the challenge of assigning sequences to operational taxonomic units (OTUs) based on the genetic distances between sequences.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.
Basic local alignment search tool.
Prospects for inferring very large phylogenies by using the neighbor-joining method.
The simulation results show that the accuracy of NJ trees decline only by approximately 5% when the number of sequences used increases from 32 to 4,096 (128 times) even in the presence of extensive variation in the evolutionary rate among lineages or significant biases in the nucleotide composition and transition/transversion ratio.