Srinivas Sridharan

Learn More
We design and implement a distributed multinode synchronous SGD algorithm, without altering hyperparameters, or compressing data, or altering algorithmic behavior. We perform a detailed analysis of scaling, and identify optimal design points for different networks. We demonstrate scaling of CNNs on 100s of nodes, and present what we believe to be record(More)
A number of prior research efforts have investigated thread scheduling mechanisms to enable better reuse of data in a processor’s cache. We propose to exploit the locality of the critical section data by enforcing an affinity between locks and the processor that has cached the execution state of the critical section protected by that lock. We investigate(More)
The <i>NAS Parallel Benchmarks</i> (NPB) are a well-known suite of benchmarks that proxy scientific computing applications. They specify several problem sizes that represent how such applications may run on different sizes of HPC systems. However, even the largest problem (class F) is still far too small to exercise properly a petascale supercomputer. Our(More)
Modern high-speed interconnection networks are designed with capabilities to support communication from multiple processor cores. The MPI endpoints extension has been proposed to ease process and thread count tradeoffs by enabling multithreaded MPI applications to efficiently drive independent network communication. In this work, we present the first(More)
The Non-Uniform Fast Fourier Transform (NUFFT) is a generalization of FFT to non-equidistant samples. It has many applications which vary from medical imaging to radio astronomy to the numerical solution of partial differential equations. Despite recent advances in speeding up NUFFT on various platforms, its practical applications are still limited, due to(More)
This paper deals with architectures that expose novel concurrency models by using light-weight multithreading and support high computation to memory ratio by leveraging technologies such as Processing-In-Memory (PIM), Embedded DRAM, or Stacked Memory [1, 2]. In this paper we explore the idea of implementing key parallel processing functions such as(More)
A subset of the Parallel Research Kernels (PRK),simplified parallel application patterns, are used to study the behavior of different runtimes implementing the PGAS programming model. The goal of this paper is to show that such an approach is practical and effective as we approach the exascale era. Our experimental results indicate that forthe kernels we(More)
This paper presents the first, 15-PetaFLOP Deep Learning system for solving scientific pattern classification problems on contemporary HPC architectures. We develop supervised convolutional architectures for discriminating signals in high-energy physics data as well as semi-supervised architectures for localizing and classifying extreme weather in climate(More)
This paper presents the challenges encountered in and potential solutions to designing scalable Software Transactional Memory (STM) for large-scale distributed memory systems with thousands of nodes. We introduce Global Transactional Memory (GTM), a generalized and scalable STM design supporting a dynamic programming model based on threadlevel parallelism,(More)