• Publications
  • Influence
Improving GPGPU concurrency with elastic kernels
TLDR
We study concurrent execution of GPU kernels using multiprogram workloads on current NVIDIA Fermi GPUs. Expand
  • 177
  • 27
  • PDF
Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices
TLDR
In this work, we present FluidiCL, an OpenCL runtime that takes a program written for a single device and uses both the CPU and the GPU to execute it. Expand
  • 86
  • 8
  • PDF
Minimizing Buffer Requirements under Rate-Optimal Schedule in Regular Dataflow Networks
TLDR
In this paper we propose a method to minimize buffer storage requirement in constructing rate-optimal compile-time (MBRO) schedules for multi-rate dataflow graphs. Expand
  • 71
  • 7
Bi-Modal DRAM Cache: Improving Hit Rate, Hit Latency and Bandwidth
TLDR
In this paper, we present Bi-Modal Cache - a flexible stacked DRAM cache organization which simultaneously achieves several objectives: (i) improved cache hit ratio, (ii) moving the tag storage overhead to DRAM, (iii) lower cache hit latency than tags-in-SRAM, and (iv) reduction in off-chip bandwidth wastage. Expand
  • 46
  • 7
  • PDF
Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors
TLDR
MEGHA, a compiler that automatically compiles MATLAB programs to enable synergistic execution on heterogeneous processors. Expand
  • 50
  • 7
  • PDF
Software Pipelined Execution of Stream Programs on GPUs
TLDR
In this paper, we describe the challenges in mapping StreamIt to GPUs and propose an efficient technique to software pipeline the execution of stream programs on GPUs. Expand
  • 108
  • 6
  • PDF
Probabilistic Shared Cache Management (PriSM)
TLDR
In this paper, we propose Probabilistic Shared Cache Management (PriSM), a framework to manage the cache occupancy of different cores at cache block granularity by controlling their eviction probabilities. Expand
  • 86
  • 6
  • PDF
Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme
TLDR
We integrate our automatic memory manager into the X10 compiler and runtime, and find that it not only results in smaller and simpler programs, but also eliminates redundant memory transfers. Expand
  • 48
  • 6
  • PDF
Emulating Optimal Replacement with a Shepherd Cache
TLDR
The inherent temporal locality in memory accesses is filtered out by the L1 cache. Expand
  • 75
  • 5
  • PDF
...
1
2
3
4
5
...