• Publications
  • Influence
The Landscape of Parallel Computing Research: A View from Berkeley
TLDR
The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar. Expand
Chisel: Constructing hardware in a Scala embedded language
TLDR
Chisel, a new hardware construction language that supports advanced hardware design using highly parameterized generators and layered domain-specific hardware languages, is introduced by embedding Chisel in the Scala programming language, raising the level of hardware design abstraction. Expand
Unbounded Transactional Memory
TLDR
A hardware implementation of unbounded transactional memory, called UTM, is described, which exploits the common case for performance without sacrificing correctness on transactions whose footprint can be nearly as large as virtual memory. Expand
The Rocket Chip Generator
Rocket Chip is an open-source Sysem-on-Chip design generator that emits synthesizable RTL. It leverages the Chisel hardware construction language to compose a library of sophisticated generators forExpand
The GAP Benchmark Suite
TLDR
A graph processing benchmark suite that specifies graph kernels, input graphs, and evaluation methodologies, but it also provides optimized baseline implementations that can be used as a workload representative of graph processing. Expand
Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors
TLDR
This paper presents a new cache management policy, victim replication, which combines the advantages of private and shared schemes, and shows that victim replication reduces the average memory access latency of the shared L2 cache by an average of 16% for multi-threaded benchmarks and 24% for single- threaded benchmarks, providing better overall performance. Expand
Direction-optimizing Breadth-First Search
TLDR
A hybrid approach is proposed that is advantageous for low-diameter graphs, which combines a conventional top-down algorithm along with a novel bottom-up algorithm that can dramatically reduce the number of edges examined, which accelerates the search as a whole. Expand
Direction-optimizing breadth-first search
TLDR
A hybrid approach is proposed that is advantageous for low-diameter graphs, which combines a conventional top-down algorithm along with a novel bottom-up algorithm that can dramatically reduce the number of edges examined, which accelerates the search as a whole. Expand
Silicon-photonic clos networks for global on-chip communication
TLDR
Analytical modeling is used to show that a 64-tile photonic Clos network consumes significantly less optical power, thermal tuning power, and area compared to global photonic crossbars over a range of photonic device parameters. Expand
A view of the parallel computing landscape
Writing programs that scale with increasing numbers of cores should be as easy as writing programs for sequential computers.
...
1
2
3
4
5
...