Marc Baboulin

Learn More
0167-8191/$ see front matter 2010 Elsevier B.V doi:10.1016/j.parco.2009.12.005 * Corresponding author. Tel.: +1 865 974 8295; fa E-mail addresses: tomov@eecs.utk.edu (S. Tomov We highlight the trends leading to the increased appeal of using hybrid multicore + GPU systems for high performance computing. We present a set of techniques that can be used to(More)
a Department of Mathematics, University of Coimbra, Coimbra, Portugal b French National Institute for Research in Computer Science and Control, Lyon, France c Department of Electrical Engineering and Computer Science, University Tennessee, Knoxville, TN, USA d Oak Ridge National Laboratory, Oak Ridge, TN, USA e University of Manchester, Manchester, United(More)
We present a computational framework for high-performance tensor contractions on GPUs. High-performance is difficult to obtain using existing libraries, especially for many independent contractions where each contraction is very small, e.g., sub-vector/warp in size. However, using our framework to batch contractions plus application-specifics, we(More)
We derive closed formulas for the condition number of a linear function of the total least squares solution. Given an over determined linear system Ax = b, we show that this condition number can be computed using the singular values and the right singular vectors of [A, b] and A. We also provide an upper bound that requires the computation of the largest(More)
We illustrate how linear algebra calculations can be enhanced by statistical techniques in the case of a square linear system <i>Ax</i>&thinsp;=&thinsp;<i>b</i>. We study a random transformation of <i>A</i> that enables us to avoid pivoting and then to reduce the amount of communication. Numerical experiments show that this randomization can be performed at(More)
We address some key issues in designing dense linear algebra (DLA) algorithms that are common for both multi/many-cores and special purpose architectures (in particular GPUs). We present them in the context of an LU factorization algorithm, where randomization techniques are used as an alternative to pivoting. This approach yields an algorithm based(More)
We study several solvers for the solution of general linear systems where the main objective is to reduce the communication overhead due to pivoting. We first describe two existing algorithms for the LU factorization on hybrid CPU/GPU architectures. The first one is based on partial pivoting and the second uses a random preconditioning of the original(More)
In this paper we describe the parallel distributed implementation of a linear solver for large-scale applications involving real symmetric positive definite or complex symmetric non-Hermitian dense systems. The advantage of this routine is that it performs a Cholesky factorization by requiring half the storage needed by the standard parallel libraries(More)
Empirical auto-tuning and machine learning techniques have been showing high potential to improve execution time, power consumption, code size, reliability and other important metrics of various applications for more than two decades. However, they are still far from widespread production use due to lack of native support for auto-tuning in an ever changing(More)
We propose in this paper a distributed packed storage format that exploits the symmetry or the triangular structure of a dense matrix. This format stores only half of the matrix while maintaining most of the efficiency compared to a full storage for a wide range of operations. This work has been motivated by the fact that, contrary to sequential linear(More)