• Publications
  • Influence
A lightweight infrastructure for graph analytics
TLDR
This paper argues that existing DSLs can be implemented on top of a general-purpose infrastructure that supports very fine-grain tasks, implements autonomous, speculative execution of these tasks, and allows application-specific control of task scheduling policies. Expand
A quantitative study of irregular programs on GPUs
TLDR
This paper defines two measures of irregularity called control-flow irregularity and memory-access irregularity, and investigates, using performance-counter measurements, how irregular GPU kernels differ from regular kernels with respect to these measures. Expand
The tao of parallelism in algorithms
TLDR
It is suggested that the operator formulation and tao-analysis of algorithms can be the foundation of a systematic approach to parallel programming. Expand
I-structures: Data structures for parallel computing
TLDR
It is difficult simultaneously to achieve elegance, efficiency and parallelism in functional programs that manipulate large data structures, and I-structures are shown to be invaluable for implementing functional data abstractions. Expand
I-structures: data structures for parallel computing
TLDR
It is difficult to achieve elegance, efficiency, and parallelism simultaneously in functional programs that manipulate large data structures, and it is shown that even in the context of purely functional languages, I-structures are invaluable for implementing functional data abstractions. Expand
Lonestar: A suite of parallel irregular programs
TLDR
The first five programs from the Lonestar benchmark suite are characterized, which target domains like data mining, survey propagation, and design automation, and it is shown that even such irregular applications often expose large amounts of parallelism in the form of amorphous data-parallelism. Expand
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm
TLDR
This chapter describes the first CUDA implementation of the classical Barnes Hut n-body algorithm that runs entirely on the GPU, concluding that GPUs can be used to accelerate irregular codes, not just regular codes. Expand
Data-centric multi-level blocking
TLDR
This work presents a simple and novel framework for generating blocked codes for high-performance machines with a memory hierarchy based on reasoning directly about the flow of data through the memory hierarchy, which permits a more direct solution to the problem of enhancing data locality. Expand
The program structure tree: computing control regions in linear time
TLDR
A linear-time algorithm for finding SESE regions and for building the PST of arbitrary control flow graphs (including irreducible ones) is given and it is shown how to use the algorithm to find control regions in linear time. Expand
Optimistic parallelism requires abstractions
TLDR
The design and implementation of a programming abstractions that permit programmers to highlight opportunities for exploiting parallelism in sequential programs are described, and a runtime system that uses these hints to execute the program in parallel is described. Expand
...
1
2
3
4
5
...