High-quality draft assemblies of mammalian genomes from massively parallel sequence data.
Next-generation sequencing (NGS) data are a rapidly growing example of big data and a source of new knowledge in science. However, sequencing errors remain unavoidable and reduce the quality of NGS data. Error correction, therefore, is a critical step in the successful utilization of NGS data, including de novo genome assembly and DNA resequencing. Since NGS throughput doubles approximately every five months and the length of NGS records (i.e., reads) is increasing, improvements in efficiency and effectiveness of computational strategies are needed. In this study, we aim to improve the performance of CloudRS, an open-source MapReduce application designed to correct sequencing errors in NGS data. We introduce the readmessage (RM) diagram to represent the set of messages, i.e., the key-value pairs generated on each read. We also present the Gradient-number Votes (GNV) scheme in order to trim off portions of the RM diagram, thereby reducing the total size of messages associated with each read. Experimental results show that the GNV scheme successfully reduce execution time and improve the quality of the de novo genome assembly.