Learn More
Many future heterogeneous systems will integrate CPUs and GPUs physically on a single chip and logically connect them via shared memory to avoid explicit data copying. Making this shared memory coherent facilitates programming and fine-grained sharing, but throughput-oriented GPUs can overwhelm CPUs with coherence requests not well-filtered by caches.(More)
Performance, power, and energy (PPE) are critical aspects of modern computing. It is challenging to accurately predict, in real time, the effect of dynamic voltage and frequency scaling (DVFS) on PPE across a wide range of voltages and frequencies. This results in the use of reactive, iterative, and inefficient algorithms for dynamically finding good DVFS(More)
Emerging Non-Volatile Memory (NVM) technologies are explored as potential alternatives to traditional SRAM/DRAM-based memory architecture in future microprocessor design. One of the major disadvantages for NVM is the latency and energy overhead associated with write operations. Mitigation techniques to minimize the write overhead for NVM-based main memory(More)
Future CMPs will combine many simple cores with deep cache hierarchies. With more cores, cache resources per core are fewer, and must be shared carefully to avoid poor utilization due to conflicts and pollution. Explicit motion of data in these architectures, such as message passing, can provide hints about program behavior that can be used to hide latency(More)
Modern CPUs employ Dynamic Voltage and Frequency Scaling (DVFS) to boost performance, lower power, and improve energy efficiency. Good DVFS decisions require accurate performance predictions across frequencies. A new hardware structure for measuring leading load cycles was recently proposed and demonstrated promising performance prediction abilities in(More)
Many future heterogeneous systems will integrate CPUs and GPUs physically on a single chip and logically connect them via shared memory to avoid explicit data copying. Making this shared memory coherent facilitates programming and fine-grained sharing, but throughput-oriented GPUs can overwhelm CPUs with coherence requests not well-filtered by caches.(More)
Deep neural networks (DNN) achieved significant breakthrough in vision recognition in 2012 and quickly became the leading machine learning algorithm in Big Data based large scale object recognition applications. The successful deployment of DNN based applications pose challenges for a cross platform software framework that enable multiple user scenarios,(More)
Deep Neural Networks (DNN), with deep layers and very high dimension of parameters, have demonstrated break-through learning capability in machine learning area. These days DNN with Big Data input are leading a new direction in large scale object recognition. DNN training requires vast amount of computing power, which poses great challenge to system design.(More)
Modern heterogeneous multiprocessors integrate CPU and GPU together to provide a boost to computational performance. With tighter integration of CPU and GPU, it is critical to share and move data more efficiently in order to leverage the computational power that a GPU can provide. Initially, DMA or PCIe devices were used to transfer data between CPU and GPU(More)
Moving data between cores on hardware coherent architectures suffers from memory latency and causes cache misses and coherence traffic, which are obstacles to achieving high performance. In this paper, we evaluate the potential for hardware optimization of message data transfer on chip multiprocessors with a combination of NAS parallel MPI benchmarks, Intel(More)