Phylogenies without Branch Bounds: Contracting the Short, Pruning the Deep

@article{Daskalakis2011PhylogeniesWB,
  title={Phylogenies without Branch Bounds: Contracting the Short, Pruning the Deep},
  author={Constantinos Daskalakis and Elchanan Mossel and S{\'e}bastien Roch},
  journal={SIAM J. Discret. Math.},
  year={2011},
  volume={25},
  pages={872-893}
}
We introduce a new phylogenetic reconstruction algorithm which, unlike most previous rigorous inference techniques, does not rely on assumptions regarding the branch lengths or the depth of the tree. The algorithm returns a forest which is guaranteed to contain all edges that are (1) sufficiently long and (2) sufficiently close to the leaves. How much of the true tree is recovered depends on the sequence length provided. The algorithm is distance-based and runs in polynomial time. 

Figures from this paper

Phylogenies without Branch Bounds: Contracting the Short, Pruning the Deep
We introduce a new phylogenetic reconstruction algorithm which, unlike most previous rigorous inference techniques, does not rely on assumptions regarding the branch lengths or the depth of the tree.
Fast Phylogenetic Tree Reconstruction Using Locality-Sensitive Hashing
We present the first sub-quadratic time algorithm that with high probability correctly reconstructs phylogenetic trees for short sequences generated by a Markov model of evolution. Due to rapid
Phylogenetic mixtures: Concentration of measure in the large-tree limit
TLDR
Using concentration of measure techniques, it is shown that mixtures of large trees are typically identifiable and derive sequence-length requirements for high-probability reconstruction.
Fast error-tolerant quartet phylogeny algorithms
Fast Algorithms for Large-Scale Phylogenetic Reconstruction
TLDR
Three novel fast phylogenetic algorithms are developed and LSHTree, the first sub-quadratic time algorithm with theoretical performance guarantees under a Markov model of sequence evolution, is applied to the problem of placing large numbers of short sequence reads onto a fixed phylogenetic tree.
Towards a Practical O(n logn) Phylogeny Algorithm
TLDR
A variety of extensions are presented which, while only slowing the algorithm down by a constant factor, make its performance nearly comparable to that of neighbour-joining, which requires O(n3) runtime.
Identifiability and inference of non-parametric rates-across-sites models on large-scale phylogenies
TLDR
A new approach for estimating general rates-across-sites models, based on a novel algorithm that clusters sites according to their mutation rate, implies, in particular, that large phylogenies are typically identifiable under rate variation.
Coalescent-based species tree estimation: a stochastic Farris transform
TLDR
This paper proposes an algorithm for phylogeny reconstruction under the multispecies coalescent model with a standard model of site substitution, and obtains a new identifiability result of independent interest: for any species tree with $n \geq 3$ species, the rooted species tree can be identified from the distribution of its unrooted weighted gene trees even in the absence of a molecular clock.
Estimating Optimal Species Trees from Incomplete Gene Trees Under Deep Coalescence
TLDR
This paper considers the problem of estimating species trees from gene trees and alignments for the general case where the gene trees or alignments can be incomplete, which means that not all the genes contain sequences for all the species.
...
...

References

SHOWING 1-10 OF 34 REFERENCES
Maximal Accurate Forests from Distance Matrices
TLDR
This work presents a fast converging method for distance-based phylogenetic inference, which is novel in two respects: first, it is the only method to guarantee accuracy when knowledge about the model tree, i.e bounds on the edge lengths, is not assumed; and, with high probability, no false assertions are made.
A short proof that phylogenetic tree reconstruction by maximum likelihood is hard
  • S. Roch
  • Biology
    IEEE/ACM Transactions on Computational Biology and Bioinformatics
  • 2006
TLDR
A short proof that computing the maximum likelihood tree is NP-hard by exploiting a connection between likelihood and parsimony observed by Tuffley and Steel.
Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction
TLDR
A simple method is presented, the Disk-Covering Method (DCM), which boosts the performance of base phylogenetic methods under various Markov models of evolution, and it is proved that for almost all trees, polylogarithmic length sequences suffice for complete accuracy with high probability, while polynomial length sequences always suffice.
Fast and reliable reconstruction of phylogenetic trees with very short edges
TLDR
This paper presents a fast converging reconstruction algorithm which returns a partially resolved topology containing all edges of the original tree whose weight exceeds some (non-trivial) lower bound, which is determined by the input sequence length, as well as some properties of the tree, such as its depth.
COMPUTATIONAL COMPLEXITY OF INFERRING PHYLOGENIES BY COMPATIBILITY
A well-known approach to inferring phylogenies involves finding a phylogeny with the largest number of characters that are perfectly compatible with it. Variations of this problem depend on whether
Optimal phylogenetic reconstruction
TLDR
The proof of Steel's conjecture is complete and a reconstruction algorithm using optimal (up to a multiplicative constant) sequence length is given to obtain an optimal reconstruction algorithm for the Jukes-Cantor model with short edges.
Inverting Random Functions II: Explicit Bounds for Discrete Maximum Likelihood Estimation, with Applications
TLDR
This paper studies inverting random functions under the maximum likelihood estimation (MLE) criterion in the discrete setting and provides explicit upper and lower bounds for MLE, both in the nonparametric and parametric setting, and gives applications to coin-tossing and phylogenetic tree reconstruction.
Nearly tight bounds on the learnability of evolution
TLDR
A very simple algorithm, which is a variant on one of the most popular algorithms used by practitioners, converges on the true tree at a rate which differs from the optimum by a constant, and the learnability of each CF tree is sandwiched between two such simpler trees.
...
...