Neungsoo Park

Learn More
Recently, several experimental studies have been conducted on block data layout in conjunction with tiling as a data transformation technique to improve cache performance. In this paper, we analyze cache and TLB performance of such alternate layouts (including block data layout and Morton layout) when used in conjunction with tiling. We derive a tight lower(More)
Run-time array redistribution is necessary to enhance the performance of parallel programs on distributed memory supercomputers. In this paper, we present an efficient algorithm for array redistribution from <i>cyclic</i>(<i>x</i>) on <i>P</i> processors to <i>cyclic</i>(<i>Kx</i>) on <i>Q</i> processors. The algorithm reduces the overall time for(More)
The Walsh-Hadamard Transform (WHT) is an important algorithmin signal processing because of its simplicity. However, in computing large size WHT, non-unit stride access results in poor cache performance leading to severe degradation in performance. This poor cache performance is also a critical problem in achieving high performance in other large size(More)
Recently, several experimental studies have been conducted on block data layout as a data transformation technique used in conjunction with tiling to improve cache performance. In this paper, we provide a theoretical analysis for the TLB and cache performance of block data layout. For standard matrix access patterns, we derive an asymptotic lower bound on(More)
Effective utilization of cache memories is a key factor in achieving high performance in computing the Discrete Fourier Transform (DFT). Most optimization techniques for computing the DFT rely on either modifying the computation and data access order or exploiting low level platform specific details, while keeping the data layout in memory static. In this(More)
A Network Intrusion Detection System (NIDS) monitors all incoming packets in the network and detects packets that are malicious to the internal system. The NIDS should also have ability to update the detection rules because new attack patterns are unpredictable. Incorporating FPGAs into the NIDS is one of the best solutions that can provide both high(More)
Throughput is a key performance metric for streaming FFT architectures. However, increasing spatial parallelism to improve throughput introduces complex routing, thus resulting in high power consumption. In this paper, we propose a high throughput energy efficient parallel FFT architecture based on Cooley-Tukey algorithm. Multiple pipeline FFT processors(More)
In this paper, to reduce the computation amount of the quarter-pel interpolation in H.264 motion compensation, two-step interpolation approach is proposed: the first step is the half-pel interpolation and the other is the quarter-pel interpolation using the previous results. The quarter-pel interpolation is performed selectively according to the motion(More)