Ultra-large alignments using phylogeny-aware profiles

@article{Nguyen2015UltralargeAU,
  title={Ultra-large alignments using phylogeny-aware profiles},
  author={Nam-phuong Nguyen and Siavash Mirarab and Keerthana Kumar and Tandy J. Warnow},
  journal={Genome Biology},
  year={2015},
  volume={16}
}
Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments and phylogenetic trees of large datasets. However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences. We present UPP, a multiple sequence alignment method that uses a new machine learning technique, the ensemble of hidden Markov models… 

Figures and Tables from this paper

Fast and accurate large multiple sequence alignments using root-to-leave regressive computation
TLDR
This work developed and validated on protein sequences a regressive algorithm that works the other way around, aligning first the most dissimilar sequences, which produces more accurate alignments than non-regressive methods, especially on datasets larger than 10,000 sequences.
Scaling statistical multiple sequence alignment to large datasets
TLDR
A method for incorporating BAli-Phy into PASTA and UPP, two strategies for enabling alignment methods to scale to large datasets, and alignment and tree accuracy results measured against the ground truth from simulations are given.
Accurate large-scale phylogeny-aware alignment using BAli-Phy
TLDR
It is shown that this approach achieves high accuracy, greatly superior to Prank, the current most popular phylogeny-aware alignment method, and is even more accurate than MAFFT, one of the top performing alignment methods in common use.
Multiple Sequence Alignment for Large Heterogeneous Datasets Using SATé, PASTA, and UPP.
TLDR
Two software packages-PASTA and UPP-for constructing alignments on large and ultra-large datasets are described and both methods have been able to produce highly accurate alignment on 1,000,000 sequences, and trees computed on these alignments are also highly accurate.
UPP2: Fast and Accurate Alignment Estimation of Datasets with Fragmentary Sequences
TLDR
UPP2 is presented, a direct improvement on UPP that produces more accurate alignments compared to leading MSA methods on datasets exhibiting substantial sequence length heterogeneity, and is among the most accurate otherwise.
Phylogeny Estimation Given Sequence Length Heterogeneity
TLDR
This study finds in particular that using UPP to align sequences and RAxML to compute a tree on the alignment provides the best accuracy, substantially outperforming trees computed using phylogenetic placement methods.
Generalized Bootstrap Supports for Phylogenetic Analyses of Protein Sequences Incorporating Alignment Uncertainty
TLDR
Unistrap, a novel approach that estimates the combined effect of alignment uncertainty and site sampling on phylogenetic tree branch supports, provides branch support estimates that take into account a larger fraction of the parameters impacting tree instability when processing datasets containing a large number of sequences.
Fully automated sequence alignment methods are comparable to, and much faster than, traditional methods in large data sets: an example with hepatitis B virus
Aligning sequences for phylogenetic analysis (multiple sequence alignment; MSA) is an important, but increasingly computationally expensive step with the recent surge in DNA sequence data. Much of
Large multiple sequence alignments with a root-to-leaf regressive method
TLDR
The regressive algorithm uses an efficient divide-and-conquer strategy to run third-party alignment methods in linear time, regardless of their original complexity, to enable analyses of extremely large genomic datasets such as the recently announced Earth BioGenome Project.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 82 REFERENCES
Inferring phylogenies of evolving sequences without multiple sequence alignment
TLDR
Compared to a multiple sequence alignment approach, D2 methods are more robust against among-site rate heterogeneity, compositional biases, genetic rearrangements and insertions/deletions, but are more sensitive to recent sequence divergence and sequence truncation.
Multiple sequence alignment: a major challenge to large-scale phylogenetics
TLDR
It is shown that as the number of sequences increases, thenumber of alignment methods that can analyze the datasets decreases, and the most accurate alignment methods are unable to analyze the very largest datasets, so that only moderately accurate aligned methods can be used on the largest datasets.
Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees
TLDR
SATé (simultaneous alignment and tree estimation), an automated method to quickly and accurately estimate both DNA alignments and trees with the maximum likelihood criterion, is presented, showing that coestimation can be both rapid and accurate in phylogenetic studies.
Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega
TLDR
A new program called Clustal Omega is described, which can align virtually any number of protein sequences quickly and that delivers accurate alignments, and which outperforms other packages in terms of execution time and quality.
PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences
TLDR
A study on biological and simulated data with up to 200,000 sequences is presented, showing that PASTA produces highly accurate alignments, improving on the accuracy and scalability of the leading alignment methods (including SATé).
SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees.
TLDR
A modification to the original SATé algorithm that improves upon SATé (which is now call SATé-I) in terms of speed and of phylogenetic and alignment accuracy, and presents two lines of evidence that using ML with gaps treated as missing data to optimize the alignment and tree produces very poor results.
FASTSP: linear time calculation of alignment accuracy
TLDR
This article proves that each of the standard techniques for comparing alignments, Developer, Modeler and Total Column scores can be computed in linear time, and presents FastSP, a linear-time algorithm for calculating these scores.
PASTA: Ultra-Large Multiple Sequence Alignment
TLDR
A study on biological and simulated data with up to 200,000 sequences is presented, showing that PASTA produces highly accurate alignments, improving on the accuracy of the leading alignment methods on large datasets, and is able to analyze much larger datasets than the current methods.
SEPP: SATe -Enabled Phylogenetic Placement
TLDR
This study presents SEPP, a general "boosting" technique to improve the accuracy and/or speed of phylogenetic placement techniques, and shows that SATé-boosting improves HMMALIGN+pplacer, placing short sequences more accurately when the set of input sequences has a large evolutionary diameter and produces placements of comparable accuracy in a fraction of the time for easier cases.
INDELible: A Flexible Simulator of Biological Sequence Evolution
TLDR
A portable and flexible application for generating nucleotide, amino acid and codon sequence data by simulating insertions and deletions (indels) as well as substitutions, which should be useful for evaluating the performance of many inference methods, including those for multiple sequence alignment, phylogenetic tree inference, and ancestral sequence, or genome reconstruction.
...
1
2
3
4
5
...