Learn More
Accelerating program performance via SIMD vector units is very common in modern processors, as evidenced by the use of SSE, MMX, VSE, and VSX SIMD instructions in multimedia, scientific, and embedded applications. To take full advantage of the vector capabilities, a compiler needs to generate efficient vector code automatically. However, most commercial and(More)
—This paper introduces SLAW, a Scalable Locality-aware Adaptive Work-stealing scheduler. The SLAW scheduler is designed to address two common limitations in current work-stealing schedulers: use of a fixed task scheduling policy and locality-obliviousness due to randomized stealing. Past work has demonstrated the pros and cons of using fixed scheduling(More)
Type systems that prevent data races are a powerful tool for parallel programming, eliminating whole classes of bugs that are both hard to find and hard to fix. Unfortunately, it is difficult to apply previous such type systems to " real " programs, as each of them are designed around a specific synchronization primitive or parallel pattern, such as locks(More)
In this paper, we present the Habanero-Java (HJ) language developed at Rice University as an extension to the original Java-based definition of the X10 language. HJ includes a powerful set of task-parallel programming constructs that can be added as simple extensions to standard Java programs to take advantage of today's multi-core and heterogeneous(More)
Modern computer systems feature multiple homogeneous or heterogeneous computing units with deep memory hierarchies, and expect a high degree of thread-level parallelism from the software. Exploitation of data locality is critical to achieving scalable parallelism, but adds a significant dimension of complexity to performance optimization of parallel(More)
One of the major productivity hurdles for parallel programming is non-determinism — a parallel program may yield different results on different executions with the same input, depending on the order in which operations are interleaved. A major source of non-determinism is data races, and checking for the absence of data races is an important candidate for(More)
Existing dynamic race detectors suffer from at least one of the following three limitations: (i)<i>space</i> overhead per memory location grows linearly with the number of parallel threads [13], severely limiting the parallelism that the algorithm can handle; (ii)<i>sequentialization</i>: the parallel program must be processed in a sequential order,(More)
Increasing the number of instructions executing in parallel has helped improve processor performance, but the technique is limited. Executing code on parallel threads and processors has fewer limitations, but most computer programs tend to be serial in nature. This paper presents a compiler optimisation that at run-time parallelises code inside a JVM and(More)
—X10 is a new object-oriented PGAS (Partitioned Global Address Space) programming language with support for distributed asynchronous dynamic parallelism that goes beyond past SPMD message-passing models such as MPI and SPMD PGAS models such as UPC and Co-Array Fortran. The concurrency constructs in X10 make it possible to express complex computation and(More)