• Publications
  • Influence
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
TLDR
A simple analytical model is proposed that estimates the execution time of massively parallel programs by considering the number of running threads and memory bandwidth and estimates the cost of memory requests, thereby estimating the overall executionTime of a program.
Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping
TLDR
Adaptive mapping is proposed, a fully automatic technique to map computations to processing elements on a CPU+GPU machine and it is shown that, by judiciously distributing works over the CPU and GPU, automatic adaptive mapping achieves a 25% reduction in execution time and a 20% reduced in energy consumption than static mappings on average for a set of important computation benchmarks.
An integrated GPU power and performance model
TLDR
An integrated power and performance prediction model for a GPU architecture to predict the optimal number of active processors for a given application and the outcome of IPP is used to control the number of running cores.
Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers
TLDR
Results show that feedback-directed prefetching eliminates the large negative performance impact incurred on some benchmarks due to prefetcher, and it is applicable to stream-based prefetchers, global-history-buffer based delta correlation prefetchery, and PC-based stridePrefetchers.
Inferring Fine-grained Control Flow Inside SGX Enclaves with Branch Shadowing
TLDR
A new, yet critical, side-channel attack, branch shadowing, that reveals fine-grained control flows (branch granularity) in an enclave and develops two novel exploitation techniques, a last branch record (LBR)-based history-inferring technique and an advanced programmable interrupt controller (APIC)-based technique to control the execution of an enclave in a finegrained manner.
Transparent Hardware Management of Stacked DRAM as Part of Memory
TLDR
This paper proposes a practical, low-cost architectural solution to efficiently enable using large fast memory as Part-of-Memory (PoM) seamlessly, without the involvement of the OS.
GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks
TLDR
GraphPIM is presented, a full-stack solution for graph computing that achieves higher performance using PIM functionality and an extension to PIM operations that can further bring performance benefits for more graph applications.
SD3: A Scalable Approach to Dynamic Data-Dependence Profiling
TLDR
This paper proposes a scalable approach to data-dependence profiling that addresses both runtime and memory overhead in a single framework, called SD3, and reduces the runtime overhead by parallelizing the dependence profiling step itself and compress memory accesses that exhibit stride patterns and compute data dependences directly in a compressed format.
A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch
TLDR
Two innovations that exploit the bursty nature of memory requests to streamline the DRAM cache are presented, including a low-cost Hit-Miss Predictor (HMP) that virtually eliminates the hardware overhead of the previously proposed multi-megabyte Miss Map structure and a Self-Balancing Dispatch mechanism that dynamically sends some requests to the off-chip memory even though the request may have hit in the die-stackedDRAM cache.
GraphBIG: understanding graph computing in the context of industrial solutions
TLDR
This paper characterized GraphBIG on real machines and observed extremely irregular memory patterns and significant diverse behavior across different computations, helping users understand the impact of modern graph computing on the hardware architecture and enables future architecture and system research.
...
1
2
3
4
5
...