SNP genotype calling with MapReduce

Abstract

Genotype measurement is a key step in genome-wide association studies -- those studies that aim to uncover the underlying genetic causes of physical traits, including disease. The leading technology for measuring genotypes is the SNP microarray, where hundreds of thousands of genetic variants are interrogated simultaneously. For some of the most commonly used high-throughput genotyping technologies, the conversion from raw measured data to genotype calls (i.e., identifying the specific genomic variants) requires the concurrent analysis of many samples, with the quality of the results crucially depending on the size of the batch. However, current software for microarray analysis is characterized by poor scalability with respect to input batch sizes. In large-scale studies, this limits the ability to harness the large number of samples available to improve the accuracy of genotype calling. Here, we present a scalable MapReduce application that offers both greater scalability and flexibility than the current state-of-the-art. The software can process datasets as large as 7000 samples in a day, it is more than one order of magnitude faster than previous solutions, and it is currently used in production.

DOI: 10.1145/2287016.2287026

5 Figures and Tables

Cite this paper

@inproceedings{Leo2012SNPGC, title={SNP genotype calling with MapReduce}, author={Simone Leo and Luca Pireddu and Gianluigi Zanetti}, booktitle={MapReduce '12}, year={2012} }