Karthikeyan Vaidyanathan

Learn More
InfiniBand has been recently standardized by the industry to design next generation high-end clusters for both datacenter and high performance computing domains. Though InfiniBand has been able support low latency and high bandwidth, traditional sockets based applications have not been able to take advantage of this; this is mainly attributed to the(More)
We design and implement a distributed multinode synchronous SGD algorithm, without altering hyperparameters, or compressing data, or altering algorithmic behavior. We perform a detailed analysis of scaling, and identify optimal design points for different networks. We demonstrate scaling of CNNs on 100s of nodes, and present what we believe to be record(More)
Dense linear algebra has been traditionally used to evaluate the performance and efficiency of new architectures. This trend has continued for the past half decade with the advent of multi-core processors and hardware accelerators. In this paper we describe how several flavors of the Linpack benchmark are accelerated on Intel's recently released(More)
We present a novel algorithm for reconstructing high-quality defocus blur from a sparsely sampled light field. Our algorithm builds upon recent developments in the area of sheared reconstruction filters and significantly improves reconstruction quality and performance. While previous filtering techniques can be ineffective in regions with complex occlusion,(More)
Lattice Quantum Chromo-dynamics (LQCD) is a computationally challenging problem that solves the discretized Dirac equation in the presence of an SU(3) gauge field. Its key operation is a matrix-vector product, known as the Dslash operator. We have developed a novel multicore architecture-friendly implementation of the Wilson-Dslash operator which delivers(More)
The Intel Xeon Phi architecture from Intel Corporation features parallelism at the level of many x86-based cores, multiple threads per core, and vector processing units. Lattice Quantum Chromodynamics (LQCD) is currently the only known model independent, non perturbative computational method for calculations in theory of the strong interactions, and is of(More)
This paper demonstrates the first tera-scale performance of Intel® Xeon Phi#8482; coprocessors on 1D FFT computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 nodes of Xeon Phi and reach 6.7 TFLOPS with 512(More)
Caching has been a very important technique in improving the performance and scalability of web-serving datacenters. The research community has proposed cooperation of caching servers to achieve higher performance benefits. These existing cooperative caching mechanisms often partially duplicate the cached data redundantly on multiple servers for higher(More)
We present an end-to-end optimization of the innovative Arbitrary high-order DERivative Discontinuous Galerkin (ADER-DG) software SeisSol targeting Intel<sup>&#174;</sup> Xeon Phi<sup>&#8482;</sup> coprocessor platforms, achieving unprecedented earthquake model complexity through coupled simulation of full frictional sliding and seismic wave propagation.(More)