Who watches the watchmen? An appraisal of benchmarks for multiple sequence alignment.

@article{Iantorno2014WhoWT,
  title={Who watches the watchmen? An appraisal of benchmarks for multiple sequence alignment.},
  author={Stefano A Iantorno and Kevin Gori and Nick Goldman and Manuel Gil and Christophe Dessimoz},
  journal={Methods in molecular biology},
  year={2014},
  volume={1079},
  pages={
          59-73
        }
}
Multiple sequence alignment (MSA) is a fundamental and ubiquitous technique in bioinformatics used to infer related residues among biological sequences. Thus alignment accuracy is crucial to a vast range of analyses, often in ways difficult to assess in those analyses. To compare the performance of different aligners and help detect systematic errors in alignments, a number of benchmarking strategies have been pursued. Here we present an overview of the main strategies-based on simulation… 
Alignathon: A competitive assessment of whole genome alignment methods
TLDR
It is found that there is substantial accuracy differences between contemporary alignment tools, and many tools worked well at shorter evolutionary distances, but fewer performed competitively at longer distances.
Alignathon: a competitive assessment of whole-genome alignment methods.
TLDR
It is found that many tools worked well at shorter evolutionary distances, but fewer performed competitively at longer distances, indicating that there are substantial accuracy differences between contemporary alignment tools.
GUIDANCE2: accurate detection of unreliable alignment regions accounting for the uncertainty of multiple parameters
TLDR
It is shown that GUIDANCE2 outperforms all previously developed methodologies to detect unreliable MSA regions and provides a set of alternative MSAs which can be useful for downstream analyses.
DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment
  • E. Wright
  • Biology, Computer Science
    BMC Bioinformatics
  • 2015
TLDR
Predicting secondary structure based on local sequence context is an efficient means of breaking the independence assumption in alignment, since secondary structure is more conserved than primary sequence, it can be leveraged to improve the alignment of distantly related proteins.
Using de novo protein structure predictions to measure the quality of very large multiple sequence alignments
TLDR
This work takes advantage of recent developments in protein structure prediction methods to create a benchmark (ContTest) for protein MSAs containing many thousands of sequences in each test case and which is based on empirical biological data.
Protein Multiple Alignments: Sequence-based vs Structure-based Programs
TLDR
This paper compared the multiple alignments resulting from 24 programs, either based on sequence, structure, or both, to reference alignments deposited in five databases and found that sequence-based programs place less gaps than structure-based Programs and the databases are the more challenging for the programs.
Benchmarking Statistical Multiple Sequence Alignment
TLDR
The results of an extensive study evaluating the most popular protein alignment methods as well as the statistical co-estimation method BAli-Phy on 1192 protein data sets from established benchmarks aswell as on 120 simulated data sets show that BAli/Phy is dramatically more accurate than the other alignment methods on the simulated data set, but is among the least accurate on the biological benchmarks.
Protein multiple sequence alignment benchmarking through secondary structure prediction
TLDR
QuanTest is described, a fully automated and highly scalable test system for protein MSAs which is based on using secondary structure prediction accuracy (SSPA) to measure alignment quality and it is shown that the scores from QuanTest are highly correlated with existing benchmark scores.
Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference
TLDR
It is shown that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs and that alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong.
Evaluating the Accuracy and Efficiency of Multiple Sequence Alignment Methods
TLDR
Alignment quality was highly dependent on the number of deletions and insertions in the sequences and that the sequence length and indel size had a weaker effect, and ProbCons was consistently on the top of list of the evaluated MSA tools.
...
...

References

SHOWING 1-10 OF 52 REFERENCES
Analysis and Comparison of Benchmarks for Multiple Sequence Alignment
TLDR
HOMSTRAD, a collection of alignments of homologous proteins, is regularly used as a benchmark for sequence alignment though it is not designed as such, and lacks annotation of reliable regions within the alignment.
Quality measures for protein alignment benchmarks
  • R. Edgar
  • Computer Science
    Nucleic acids research
  • 2010
Multiple protein sequence alignment methods are central to many applications in molecular biology. These methods are typically assessed on benchmark datasets including BALIBASE, OXBENCH, PREFAB and
How well does the HoT score reflect sequence alignment accuracy?
  • B. Hall
  • Biology
    Molecular biology and evolution
  • 2008
TLDR
This study shows thatHoT scores and the alignment accuracies are positively correlated, so alignments with higher HoT scores are preferable, however, HoT Scores are overestimates of alignment accuracy in general, with the extent of overestimation depending on the method used for multiple sequence alignment.
Phylogenetic assessment of alignments reveals neglected tree signal in gaps
TLDR
This study provides the broad community relying on sequence alignment with important practical recommendations, sets superior standards for assessing alignment accuracy, and paves the way for the development of phylogenetic inference methods of significantly higher resolution.
Measuring the distance between multiple sequence alignments
TLDR
Four metrics to compare MSAs are introduced, which include the position in a sequence where a gap occurs or the location on a phylogenetic tree where an insertion or deletion event occurs and how the different metrics in combination can yield more information about MSA methods and the differences between them are demonstrated.
A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives
TLDR
It is demonstrated that the existing MSA methods can be exploited in combination to improve alignment accuracy, although novel approaches will still be needed to fully explore the most difficult regions.
Towards realistic benchmarks for multiple alignments of non-coding sequences
TLDR
A method to generate benchmarks for multiple alignments of Drosophila non-coding sequences that will help practitioners of comparative genomics deal with the effects of alignment errors, by providing accurate estimates of the extent of these errors.
OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy
TLDR
A suite of reference alignments derived from the comparison of protein three-dimensional structures together with evaluation measures and software that allow automatically generated alignments to be benchmarked provides a convenient method to assess progress in sequence alignment techniques.
Issues in bioinformatics benchmarking: the case study of multiple sequence alignment
TLDR
This work discusses the development of formal benchmarks, designed to represent the current problems encountered in the bioinformatics field, and considers several criteria for building good benchmarks and the advantages to be gained when they are used intelligently.
Automatic assessment of alignment quality
TLDR
This work describes a simple, yet elegant, solution to assess the biological accuracy of alignments automatically, based on the comparison of several alignments of the same sequences: the average overlap score and the multiple overlap score.
...
...