Naraig Manjikian

Learn More
Loop fusion improves data locality and reduces synchronization in data-parallel applications. However, loop fusion is not always legal. Even when legal, fusion may introduce loop-carried dependences which reduce parallelism. In addition, performance losses result from cache conflicts in fused loops. We present new, systematic techniques which: (1) allow(More)
Small-scale multiprocessors are becoming increasingly economical and common, whereas larger multiprocessors continue to have higher per-node costs. The NUMAchine multiprocessor project seeks to make large-scale multiprocessors more economical while maintaining high performance by exploring architectural and hardware features for low-cost, modular(More)
This paper describes multiprocessor enhancements of the SimpleScalar tool set. The core simulation code has been modified to support multiprocessing, and a run-time library has been introduced for thread creation and synchronization. Measurements using the SPLASH-2 parallel benchmark suite [13] indicate that the multiprocessor enhancements introduce(More)
An approach for high performance parallel logic simulation on a local area network of workstation computers is discussed in this paper. The single, shared transmission medium often found in such networks places limitations on parallel execution, hence a reduction in the frequency of synchronization is pursued by combining a circuit partitioning methodology(More)
Tiling exploits temporal reuse carried by an outer loop of a loop nest to enhance cache locality. Loop skewing is typically required to make tiling legal. This restricts parallelism to wavefronts in the tiled iteration space. For a small number of processors, wavefront parallelism can be efficiently exploited using dynamic selfscheduling with a large tile(More)
This paper describes the modeling and optimization of a hierarchical ring interconnect for system-on-chip multiprocessors. We have selected hierarchical rings for study because they exhibit properties which lend themselves to efficient SoC interconnects. Using our model, we are able to tune certain design parameters in order to reduce energy consumption. We(More)
ÐWavefront parallelism, in which parallelism is limited to hyperplanes in an iteration space, can arise when compilers apply tiling to loop nests to enhance locality. Previous approaches for scheduling wavefront parallelism focused on maximizing parallelism, balancing workloads, and reducing synchronization. In this paper, we show that on large-scale(More)
To achieve speedup for multi-node, multi-GPU computing platforms, it is necessary to overcome performance bottlenecks in networks based on Ethernet or Infiniband. This paper describes an FPGA implementation of a custom network interface for an optical link between PCIe buses of compute nodes. The implementation uses an Altera Stratix IV chip with integrated(More)