Alignment-Free Phylogenetic Reconstruction: Sample Complexity via a Branching Process Analysis

  title={Alignment-Free Phylogenetic Reconstruction: Sample Complexity via a Branching Process Analysis},
  author={Constantinos Daskalakis and S{\'e}bastien Roch},
We present an efficient phylogenetic reconstruction algorithm allowing insertions and deletions which provably achieves a sequence-length requirement (or sample complexity) growing polynomially in the number of taxa. Our algorithm is distance-based, that is, it relies on pairwise sequence comparisons. More importantly, our approach largely bypasses the difficult problem of multiple sequence alignment. 

Figures from this paper

Hands-on Introduction to Sequence-Length Requirements in Phylogenetics
  • S. Roch
  • Physics
    Bioinformatics and Phylogenetics
  • 2019
In this tutorial, through a series of analytical computations and numerical simulations, we review many known insights into a fundamental question: how much data is needed to reconstruct the Tree of
Efficient estimation of evolutionary distances
A new alignment-free approach for phylogeny reconstruction is introduced, and the corresponding program, andi, is orders of magnitude faster than classical approaches and also superior to comparable alignment- free methods.
Ultra-large alignments using phylogeny-aware profiles
UPP is presented, a multiple sequence alignment method that uses a new machine learning technique, the ensemble of hidden Markov models, which produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences.
Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model
This work shows how to calculate distances under a no-strain bias restriction of the General Time Reversible (GTR) model called TK4 without relying on alignments, and examines the effectiveness of the method on real genomic data.
Distance-based Species Tree Estimation: Information-Theoretic Trade-off between Number of Loci and Sequence Length under the Coalescent
This work derives an information-theoretic trade-off between the number of genes needed for an accurate reconstruction and the sequence length of the genes, and shows that to detect a branch of length f, one needs m = Theta(1/[f^{2} \sqrt{k}]).
Statistically Consistent k-mer Methods for Phylogenetic Tree Reconstruction
It is shown that a standard approach of using the squared Euclidean distance between k-mer vectors to approximate a tree metric can be statistically inconsistent, and model-based distance corrections for orthologous sequences without gaps are derived, which lead to consistent tree inference.
Phase transition in the sample complexity of likelihood-based phylogeny inference
This work proves a new upper bound on the sequence-length requirement of maximum likelihood that matches up to the known lower bound for some standard models of evolution, and shows in a precise quantitative manner that the more different two evolutionary trees are, the easier it is to distinguish their output.
Impossibility of phylogeny reconstruction from k-mer counts
It is established that the joint leaf distributions of $k-mer counts on two distinct trees have total variation distance bounded away from $1$ as the sequence length tends to infinity, so that the two distributions cannot be distinguished with probability going to one in that asymptotic regime.
Impossibility of Consistent Distance Estimation from Sequence Lengths Under the TKF91 Model.
It is established that the distributions of pairs of sequence lengths at different distances cannot be distinguished with probability going to one.
Statistically consistent and computationally efficient inference of ancestral DNA sequences in the TKF91 model under dense taxon sampling
This work considers the problem of reconstructing ancestral sequences on a known phylogeny in a model of sequence evolution incorporating nucleotide substitutions, insertions and deletions, specifically the classical TKF91 process and gives the first explicit reconstruction algorithm with provable guarantees under constant rates of mutation.


Alignment-Free Phylogenetic Reconstruction
This work introduces the first polynomial-time phylogenetic reconstruction algorithm under a model of sequence evolution allowing insertions and deletions (or indels) and requires sequence lengths growing polynomially in the number of leaf taxa.
Phylogenies without Branch Bounds: Contracting the Short, Pruning the Deep
We introduce a new phylogenetic reconstruction algorithm which, unlike most previous rigorous inference techniques, does not rely on assumptions regarding the branch lengths or the depth of the tree.
Phylogenetic Reconstruction with Insertions and Deletions
This paper gives the first efficient algorithm for phylogenetic reconstruction of evolutionary trees which uses sequences of poly logarithmic length and introduces two new tools: a new distance measure, and a new reconstruction guarantee which are tailored to deal with insertions and deletions.
On the Complexity of Multiple Sequence Alignment
It is shown that the first problem is NP-complete and the second is MAX SNP-hard; the complexity of tree alignment with a given phylogeny is also considered.
Invertibility of the TKF model of sequence evolution.
  • B. Thatte
  • Biology, Mathematics
    Mathematical biosciences
  • 2006
The Performance of Neighbor-Joining Methods of Phylogenetic Reconstruction
An upper bound on the amount of data necessary to reconstruct the topology with high confidence is demonstrated by finding conditions under which these methods will determine the correct tree topology and showing that these perform as well as possible in a certain sense.
Sequence Length Requirement of Distance-Based Phylogeny Reconstruction: Breaking the Polynomial Barrier
  • S. Roch
  • Computer Science
    2008 49th Annual IEEE Symposium on Foundations of Computer Science
  • 2008
We introduce a new distance-based phylogeny reconstruction technique which provably achieves, at sufficiently short branch lengths, a sequence length requirement growing slower than any polynomial.
Inching toward reality: An improved likelihood model of sequence evolution
Parameter estimation and alignment procedures that incorporate generalizations to permit approximate treatment of multiple-base insertions and deletions as well as regional heterogeneity of substitution rates are developed.
BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny
Ali-Phy is a Bayesian posterior sampler that employs Markov chain Monte Carlo to explore the joint space of alignment and phylogeny given molecular sequence data and automatically utilizes information in shared insertion/deletions to help infer phylogenies.