• Corpus ID: 57189450

ATHENA: Automated Tuning of Genomic Error Correction Algorithms using Language Models

  title={ATHENA: Automated Tuning of Genomic Error Correction Algorithms using Language Models},
  author={Mustafa Abdallah and Ashraf Y. Mahgoub and Saurabh Bagchi and Somali Chaterji},
The performance of most error-correction algorithms that operate on genomic sequencer reads is dependent on the proper choice of its configuration parameters, such as the value of k in k-mer based techniques. In this work, we target the problem of finding the best values of these configuration parameters to optimize error correction. We perform this in a data-driven manner, due to the observation that different configuration parameters are optimal for different datasets, i.e., from different… 

Figures and Tables from this paper

Panel 2 Position Paper: AI could Solve the World’s Healthcare Problems and that too at Scale!

  • S. Chaterji
  • Medicine
    2019 11th International Conference on Communication Systems & Networks (COMSNETS)
  • 2019
Overall, AI can be thought of as augmented intelligence that can leverage both better observational and interventional capabilities for delivering precision healthcare and augmented monitoring capabilities to the masses at scale.



Fiona: a parallel and automatic strategy for read error correction

Fiona is an accurate parameter-free read error–correction method that can be run on inexpensive hardware and can make use of multicore parallelization whenever available and is able to correct substitution, insertion and deletion errors and can be applied to any sequencing technology.

Informed and automated k-mer size selection for genome assembly

A fast and accurate sampling method is developed that constructs approximate abundance histograms with several orders of magnitude performance improvement over traditional methods and a fast heuristic is presented that uses the generated abundance histogram for putative k values to estimate the best possible value of k.

Blue: correcting sequencing errors using consensus and context

Blue is an error-correction algorithm based on k-mer consensus and context that can correct substitution, deletion and insertion errors, as well as uncalled bases, and is usable on large sequencing datasets.

Evaluation of the impact of Illumina error correction tools on de novo genome assembly

It is confirmed that most EC tools reduce the number of errors in sequencing data without introducing many new errors, but many EC tools suffer from poor performance in certain sequence contexts such as regions with low coverage or regions that contain short repeated or low-complexity sequences.

Reptile: representative tiling for short read error correction

A novel approach, termed Reptile, for error correction in short-read data from next-generation sequencing that outperforms previous methods in the percentage of errors removed from the data and the accuracy in true base assignment and a significant reduction in run time and memory usage have been achieved.

SARVAVID: A Domain Specific Language for Developing Scalable Computational Genomics Applications

A domain-specific language, called Sarvavid, which provides these kernels as language constructs, and inherently supports exploitation of parallelism across multiple nodes, which can improve programmer productivity, and provide effective scalability with growing data.

SHREC: a short-read error correction method

SHREC, a new algorithm for correcting errors in short-read data that uses a generalized suffix trie on the read data as the underlying data structure, achieves an error correction accuracy of over 80% for simulated data and over 88% for real data.

Enhanced protein domain discovery by using language modeling techniques from speech recognition

This work discovers an unannotated Tf_Otx Pfam domain on the cone rod homeobox protein, which suggests a possible mechanism for how the V242M mutation on this protein causes cone-rod dystrophy.

MG-RAST version 4 - lessons learned from a decade of low-budget ultra-high-throughput metagenome analysis

The MG-RAST team intends to support the Common Workflow Language as a standard to specify bioinformatics workflows, both to facilitate development and efficient high-performance implementation of the community's data analysis tasks.

A survey of error-correction methods for next-generation sequencing

This article provides a comprehensive review of many error-correction methods, and establishes a common set of benchmark data and evaluation criteria to provide a comparative assessment.