Using geometric structures to improve the error correction algorithm of high-throughput sequencing data on MapReduce framework

Abstract

Next-generation sequencing (NGS) data are a rapidly growing example of big data and a source of new knowledge in science. However, sequencing errors remain unavoidable and reduce the quality of NGS data. Error correction, therefore, is a critical step in the successful utilization of NGS data, including de novo genome assembly and DNA resequencing. Since NGS throughput doubles approximately every five months and the length of NGS records (i.e., reads) is increasing, improvements in efficiency and effectiveness of computational strategies are needed. In this study, we aim to improve the performance of CloudRS, an open-source MapReduce application designed to correct sequencing errors in NGS data. We introduce the readmessage (RM) diagram to represent the set of messages, i.e., the key-value pairs generated on each read. We also present the Gradient-number Votes (GNV) scheme in order to trim off portions of the RM diagram, thereby reducing the total size of messages associated with each read. Experimental results show that the GNV scheme successfully reduce execution time and improve the quality of the de novo genome assembly.

DOI: 10.1109/BigData.2014.7004306

5 Figures and Tables

Cite this paper

@article{Chung2014UsingGS, title={Using geometric structures to improve the error correction algorithm of high-throughput sequencing data on MapReduce framework}, author={Wei-Chun Chung and Yu-Jung Chang and Charles Tzu-Chi Lee and Jan-Ming Ho}, journal={2014 IEEE International Conference on Big Data (Big Data)}, year={2014}, pages={784-789} }