The Design and Implementation of FFTW3

@article{Frigo2005TheDA,
  title={The Design and Implementation of FFTW3},
  author={Matteo Frigo and Steven G. Johnson},
  journal={Proceedings of the IEEE},
  year={2005},
  volume={93},
  pages={216-231}
}
FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with hand-optimized libraries, and describes the software structure that makes our current FFTW3 version flexible and adaptive. We further discuss a new algorithm for real-data DFTs of prime size, a new way of implementing DFTs by means of machine-specific single-instruction… 

Figures from this paper

Fast Fourier Transform in Large-Scale Systems
  • D. Takahashi
  • Computer Science
    The Art of High Performance Computing for Computational Science, Vol. 1
  • 2019
TLDR
This chapter presents an introduction to the basis of the FFT and its implementation in parallel computing, and provides up-to-date computational techniques relevant to the F FT in state-of-the-art processors.
The Fastest Fourier Transform in the South
TLDR
FFTS is a discrete Fourier transform library that achieves state-of-the-art performance using a new cache-oblivious algorithm implemented with run-time specialization, and is, in almost all cases, faster than self-tuning libraries such as FFTW, and even vendor-tuned librariessuch as Intel IPP and Apple vDSP.
FFT Implementation on a Streaming Architecture
  • J. Lobeiras, M. Amor, R. Doallo
  • Computer Science
    2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing
  • 2011
TLDR
This paper proposes an efficient implementation of the FFT with AMD's Brook+ language, describing several features and optimization strategies, and analyzing the scalability and performance compared to other well-known existing solutions.
High performance implementation of the inverse TFT
TLDR
A high performance implementation of the inverse truncated Fourier transform is reported which poses additional challenges compared to that of the forward transform and provides significant performance improvement over zero-padding approaches even when high-performance FFT libraries are used.
Generating symmetric DFTs and equivariant FFT algorithms
This paper presents a code generator which produces efficient implementations of multi-dimensional fast Fourier transform (FFT) algorithms which utilize symmetries in the input data to reduce memory
Automatic Tuning for Parallel FFTs
  • D. Takahashi
  • Computer Science
    Software Automatic Tuning, From Concepts to State-of-the-Art Results
  • 2010
TLDR
The performance results demonstrate that the proposed implementation of parallel FFTs with automatic performance tuning is efficient for improving the performance.
IMPLEMENTATION OF FFT ALGORITHM
TLDR
An efficient algorithm to compute 8 point FFT has been devised in which a butterfly unit computes the output and then feeds those outputs as inputs to the next butterfly units so as to compute the overall FFT.
A Fast Algorithm With Less Operations for Length- DFTs
TLDR
A fast Fourier transform (FFT) algorithm for computing length- DFTs that achieves reduction of arithmetic complexity over the related algorithms.
FFTSS: A High Performance Fast Fourier Transform Library
  • A. Nukada
  • Computer Science
    2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings
  • 2006
TLDR
A new fast Fourier transform library is introduced which provides the source code which compilers can optimize easily to achieve high performance on various processors.
An Implementation of Parallel 1-D FFT on the K Computer
  • D. Takahashi, Atsuya Uno, M. Yokokawa
  • Computer Science
    2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems
  • 2012
TLDR
The proposed implementation of a parallel one-dimensional fast Fourier transform (FFT) on the K computer is based on the six-step FFT algorithm, which can be altered into the recursive six- step F FT algorithm to reduce the number of cache misses.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 82 REFERENCES
FFTW: an adaptive software architecture for the FFT
  • Matteo Frigo, Steven G. Johnson
  • Computer Science
    Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181)
  • 1998
TLDR
An adaptive FFT program that tunes the computation automatically for any particular hardware, and tests show that FFTW's self-optimizing approach usually yields significantly better performance than all other publicly available software.
A fast Fourier transform compiler
TLDR
The internals of this special-purpose compiler, called genfft, are described in some detail, and it is argued that a specialized compiler is a valuable tool.
Vectorizing the FFTs
A Comprehensive DFT API for Scientific Computing
  • P. T. P. Tang
  • Computer Science
    The Architecture of Scientific Software
  • 2000
TLDR
This paper forms an API for DFT computation that encompasses all the functionality that are offered by a number of popular packages combined, allows easy porting from existing codes, and exhibits a systematic naming convention with relatively short calling sequences.
Architecture independent short vector FFTs
This paper introduces an SIMD vectorization for FFTW-the "fastest Fourier transform in the west" proposed by Frigo and Johnson (see Proceedings of the ACM SIGPLAN '99 , p.169-180, 1999). The new
Real-valued fast Fourier transform algorithms
TLDR
A new implementation of the real-valued split-radix FFT is presented, an algorithm that uses fewer operations than any otherreal-valued power-of-2-length FFT.
On computing the split-radix FFT
TLDR
This paper presents an efficient Fortran program that computes the Duhamel-Hollmann split-radix FFT, which seems to require the least total arithmetic of any power-of-two DFT algorithm.
FFT algorithms for vector computers
Self-Sorting In-Place Fast Fourier Transforms
TLDR
It is shown how the familiar radix-2 Fast Fourier Transform algorithm can be extended toradix-3,Radix-4, radIX-5, and finally to mixed-radix FFTs, and how these new versions of the FFT require neither an unscrambling step nor work space.
A linear filtering approach to the computation of discrete Fourier transform
TLDR
It is shown that the discrete equivalent of a chirp filter is needed to implement the computation of the discrete Fourier transform (DFT) as a linear filtering process, and that use of the conventional FFT permits the computations in a time proportional to N \log_{2} N for any N.
...
1
2
3
4
5
...