Scalable Parallelization of a Markov Coalescent Genealogy Sampler
Highly optimized library implementations for important scientific kernels can improve scientific productivity. To this end, we are currently developing the Phylogenetic Likelihood Library (PLL) that implements functions to compute and optimize the phylogenetic likelihood score on evolutionary trees. Here, we focus on novel techniques to orchestrate likelihood computations on large vector-like processors such as GPUs. We present a novel scheme for vectorizing computations and organizing conditional likelihood arrays (CLAs) in such a way that they do not need to be transferred at all between the GPU and the CPU. We compare the performance of our GPU implementation for DNA data with a highly optimized x86 version of the PLL that relies on manually tuned AVX intrinsics. Our GPU implementation accelerates the likelihood computations by a factor of two compared to the, most probably, currently fastest available x86 implementation. We conclude that, a hybrid GPU-CPU version needs to be developed and integrated into the PLL to leverage the computational power of modern desktop systems and clusters.