Jiayuan Meng

Learn More
SIMD organizations amortize the area and power of fetch, decode, and issue logic across multiple processing units in order to maximize throughput for a given area and power budget. However, through-put is reduced when a set of threads operating in lockstep (a warp) are stalled due to long latency memory accesses. The resulting idle cycles are extremely(More)
Graphic processors (GPUs), with many lightweight data-parallel cores, can provide substantial parallel computational power to accelerate general purpose applications. To best utilize the GPU's parallel computing resources , it is crucial to understand how GPU architectures and programming models can be applied to different categories of traditionally CPU(More)
  • Prateeksha Satyamoorthy, Kevin Skadron, Stuart Wolf, Curriculum Representative, James H Dean, Aylor +10 others
  • 2011
In the context of massively parallel processors such as Graphics Processing Units (GPUs), an emerging non-volatile memory – STT-RAM – provides substantial power, area savings, and increased capacity compared to the conventionally used SRAM. The use of highly dense, low static power STT-RAM in processors that run just few threads of execution does not seem(More)
SIMD organizations have shown to allow high throughput for data-parallel applications. They can operate on multiple datapaths under the same instruction sequencer, with its set of operations happening in lockstep sometimes referred to as warps and a single lane referred to as a thread. However, ability of SIMD to gather from disparate addresses instead of(More)
Diminishing returns in single thread performance have forced a reevaluation of priorities in microprocessor design. Recent archi-tectures have foregone deeper pipelining in favor of multiple cores per chip and multiple threads per core. The day approaches when processors with hundreds or thousands of cores are commonplace, but programming models for these(More)
The history of parallel computing shows that good performance is heavily dependent on data locality. Prior knowledge of data access patterns allows for optimizations that reduce data movement, achieving lower data access latencies. Compilers and runtime systems, however, have difficulties in speculating on locality issues among threads. Future multicore(More)
Computer games have become a driving application in the personal computer industry. For computer architects designing general purpose microprocessors, understanding the characteristics of this application domain is important to meet the needs of this growing market demographic. In addition, games and 3D-graphics applications are some of the most demanding(More)
  • 1