Norman Rubin

Learn More
Spike is a profile-directed optimizer for Alpha/NT executables that is actively being used to optimize shipping products. Spike consists of the Spike Optimization Environment (SOE) and the Spike Optimizer. Through both a graphical interface and a commandline interface, the Spike Optimization Environment provides a simple means to instrument and optimize(More)
Modern compilers employ sophisticated instruction scheduling techniques to shorten the number of cycles taken to execute the instruction stream. In addition to correctness, the instruction scheduler must also ensure that hardware resources are not oversubscribed in any cycle. For a contemporary processor implementation with multiple pipelines and complex(More)
GPUs have been proven effective for structured applications that map well to the rigid 1D-3D grid of threads in modern bulk synchronous parallel (BSP) programming languages. However, less success has been encountered in mapping data intensive irregular applications such as graph analytics, relational databases, and machine learning. Recently introduced(More)
State-of-art graphics processing units (GPUs) employ the single-instruction multiple-data (SIMD) style execution to achieve both high computational throughput and energy efficiency. As previous works have shown, there exists significant computational redundancy in SIMD execution, where different execution lanes operate on the same operand values. Such value(More)
OpenCL is an industry standard for parallel programming on heterogeneous devices. With OpenCL, compute-intensive portions of an application can be offloaded to a variety of processing units within a system. OpenCL is the first standard that focuses on portability, allowing programs to be written once and run seamlessly on multiple, heterogeneous devices,(More)
Many developers have begun to realize that heterogeneous multi-core and many-core computer systems can provide significant performance opportunities to a range of applications. Typical applications possess multiple components that can be parallelized; developers need to be equipped with proper performance tools to analyze program flow and identify(More)
On-chip shared memory (a.k.a. local data share) is a critical resource to many GPGPU applications. In current GPUs, the shared memory is allocated when a thread block (also called a workgroup) is dispatched to a streaming multiprocessor (SM) and is released when the thread block is completed. As a result, the limited capacity of shared memory becomes a(More)
A frequently used method of clustering is a technique called k-means clustering. The k-means algorithm consists of two steps: A map step, which is simple to execute on a GPU, and a reduce step, which is more problematic. Previous researchers have used a hybrid approach in which the map step is computed on the GPU and the reduce step is performed on the CPU.(More)