Efficiently Computing Arbitrarily-Sized Robinson-Foulds Distance Matrices

  title={Efficiently Computing Arbitrarily-Sized Robinson-Foulds Distance Matrices},
  author={Seung-Jin Sul and Grant R. Brammer and Tiffani L. Williams},
In this paper, we introduce the HashRF(p,q) algorithm for computing RF matrices of large binary, evolutionary tree collections. The novelty of our algorithm is that it can be used to compute arbitrarily-sized (p×q) RF matrices without running into physical memory limitations. In this paper, we explore the performance of our HashRF(p,q) approach on 20,000 and 33,306 biological trees of 150 taxa and 567 taxa trees, respectively, collected from a Bayesian analysis. When computing the all-to-all RF… 
Fast hash -based algorithms for analyzing large collections of evolutionary trees
This thesis presents two fast algorithms— HashCS and HashRF —for analyzing large collections of evolutionary trees based on a novel hash table data structure, which provides a convenient and fast approach to store and access the bipartition information collected from the tree collections.
MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees
MrsRF (MapReduce Speeds up RF), a multi-core algorithm to generate a t × t Robinson-Foulds distance matrix between t trees using the MapReduce paradigm, is introduced, concluding that RF matrices play a critical role in developing techniques to summarize large collections of trees.
Efficient algorithms for comparing, storing, and sharing large collections of evolutionary trees
This dissertation created several efficient algorithms that allow biologists to easily compare, store and share tree collections over tens to hundreds of thousands of phylogenetic trees, and developed Noria, a novel version control system that allows biologists to seamlessly manage and share their phylogenetic analyses.
Efficient algorithms for phylogenetic post-analysis
This dissertation proposes bootstopping criteria which are designed to provide on-the-fly guidance for determining when enough bootstrap replicates have been reconstructed, and presents novel theory and efficient algorithms to identify rogue taxa, as well as a novel technique for interpreting the results.
How Many Bootstrap Replicates Are Necessary?
This paper proposes stopping criteria, that is, thresholds computed at runtime to determine when enough replicates have been generated, and reports on the first large-scale experimental study to assess the effect of the number of replicates on the quality of support values, including the performance of the proposed criteria.
Big Cat Phylogenies, Consensus Trees, and Computational Thinking
The pantherine cats provide a small, relevant example to explore the computational techniques for constructing consensus trees, and it is the hope that life scientists enjoy peeking under the computational hood of consensus tree construction and share their positive experiences with others in their community.
Using tree diversity to compare phylogenetic heuristics
This work develops new techniques to evaluate phylogenetic heuristics based on both tree scores and topologies to compare Pauprat and Rec-I-DCM3, two popular Maximum Parsimony search algorithms and shows that there is value to comparing heuristic beyond the parsimony scores that they find.
Accurate simulation of large collections of phylogenetic trees
  • Suzanne J. Matthews
  • Biology
    2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
  • 2015
TreeSim is introduced, a software package that simulates large tree collections from published consensus trees and implements the new simulation algorithm, the combined consensus, which is expected to play a critical role in guiding the algorithmic development of new approaches that support the growth of phylogenetic data.
Phylomark, a Tool To Identify Conserved Phylogenetic Markers from Whole-Genome Alignments
The Phylomark algorithm was developed to identify a minimal number of useful phylogenetic markers that recapitulate the WGA phylogeny and can be employed to determine the minimal marker set for any organism that has sufficient genome sequencing.


A Randomized Algorithm for Comparing Sets of Phylogenetic Trees
A new randomized algorithm, called Hash-RF, that computes the all-to-all Robinson-Foulds (RF) distance—the most common distance metric for comparing two phylogenetic trees, and uses a hash table to organize the bipartitions of a tree and a universal hashing function makes the algorithm randomized.
Efficiently Computing the Robinson-Foulds Metric
A randomized approximation scheme that provides, in sublinear time and with high probability, a (1 + epsilon) approximation of the true RF metric, and gives a unified framework for edge-based tree algorithms in which implementation tradeoffs are clear.
Statistically based postprocessing of phylogenetic analysis by clustering
This paper proposes bicriterion problems, in particular using the concept of information loss, and new consensus trees called characteristic trees that minimize the information loss that are obtained by using clustering algorithms on the set of candidate trees.
Optimal algorithms for comparing trees with labeled leaves
Algorithms are described that exploit a special representation of the clusters of any treeT Rn, one that permits testing in constant time whether a given cluster exists inT, and enable well-known indices of consensus between two trees to be computed inO(n) time.
Analysis and visualization of tree space.
The use of multidimensional scaling of tree-to-tree pairwise distances to visualize the relationships among sets of phylogenetic trees is explored and found to be useful for exploring "tree islands", for comparing sets of trees obtained from bootstrapping and Bayesian sampling, and for comparing multiple Bayesian analyses.
A 567‐Taxon Data Set for Angiosperms: The Challenges Posed by Bayesian Analyses of Large Data Sets
Bayesian analyses of a three‐gene, 567‐taxon (560 angiosperms, seven outgroups) data set revealed the analytical challenges posed by such large data sets, and recovered a topology highly similar to that found previously with parsimony.
Bayesian Inference of Phylogeny and Its Impact on Evolutionary Biology
Bayesian inference of phylogeny brings a new perspective to a number of outstanding issues in evolutionary biology, including the analysis of large phylogenetic trees and complex evolutionary models and the detection of the footprint of natural selection in DNA sequences.
Unearthing the molecular phylodiversity of desert soil green algae (Chlorophyta).
Exclusive molecular phylodiversity (E) is used to quantify the amount of evolutionary divergence unique to desert-dwelling green algae (Chlorophyta) in microbiotic crust communities and challenge conventional wisdom, which holds that there was a single origin of terrestrial green plants and that green algae are merely incidental visitors rather than indigenous components of desert communities.
MRBAYES: Bayesian inference of phylogenetic trees
The program MRBAYES performs Bayesian inference of phylogeny using a variant of Markov chain Monte Carlo, and an executable is available at http://brahms.rochester.edu/software.html.
CLUTO—software for clustering high-dimensional datasets
  • Internet Website (last accessed,
  • 2008