Learn More
Recently, several experimental studies have been conducted on block data layout as a data transformation technique used in conjunction with tiling to improve cache performance. In this paper, we provide a theoretical analysis for the TLB and cache performance of block data layout. For standard matrix access patterns, we derive an asymptotic lower bound on(More)
—Recently, several experimental studies have been conducted on block data layout in conjunction with tiling as a data transformation technique to improve cache performance. In this paper, we analyze cache and TLB performance of such alternate layouts (including block data layout and Morton layout) when used in conjunction with tiling. We derive a tight(More)
Run-time array redistribution is necessary to enhance the performance of parallel programs on distributed memory supercomputers. In this paper, we present an efficient algorithm for array redistribution from <i>cyclic</i>(<i>x</i>) on <i>P</i> processors to <i>cyclic</i>(<i>Kx</i>) on <i>Q</i> processors. The algorithm reduces the overall time for(More)
Effective utilization of cache memories is a key factor in achieving high performance in computing the Discrete Fourier Transform (DFT). Most optimization techniques for computing the DFT rely on either modifying the computation and data access order or exploiting low level platform specific details, while keeping the data layout in memory static. In this(More)
A Network Intrusion Detection System (NIDS) monitors all incoming packets in the network and detects packets that are malicious to the internal system. The NIDS should also have ability to update the detection rules because new attack patterns are unpredictable. Incorporating FPGAs into the NIDS is one of the best solutions that can provide both high(More)
—Throughput is a key performance metric for streaming FFT architectures. However, increasing spatial par-allelism to improve throughput introduces complex routing, thus resulting in high power consumption. In this paper, we propose a high throughput energy efficient parallel FFT architecture based on Cooley-Tukey algorithm. Multiple pipeline FFT processors(More)
—Effective utilization of cache memories is a key factor in achieving high performance for computing large signal transforms. Nonunit stride access in the computation of large signal transforms results in poor cache performance, leading to severe degradation in the overall performance. In this paper, we develop a cache-conscious technique, called a dynamic(More)