A fast Fourier transform compiler

Authors:
Matteo Frigo
Affiliations:
MIT Laboratory for Computer Science, 545 Technology Square NE43-203, Cambridge, MA
Venue:
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Year:
1999

Citing 16
Cited 105

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
The input/output complexity of sorting and related problems

Communications of the ACM
Discrete-time signal processing

Discrete-time signal processing
Fast fourier transforms: a tutorial review and a state of the art

Signal Processing
Factorization method for crystallographic Fourier transforms

Advances in Applied Mathematics
A framework for generating distributed-memory parallel programs for block recursive algorithms

Journal of Parallel and Distributed Computing
An analysis of dag-consistent distributed shared-memory algorithms

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Using C++ template metaprograms

C++ gems
How to declare an imperative

ACM Computing Surveys (CSUR)
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms

The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Advanced compiler design and implementation

Advanced compiler design and implementation
The nofib Benchmark Suite of Haskell Programs

Proceedings of the 1992 Glasgow Workshop on Functional Programming
I/O complexity: The red-blue pebble game

STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
The Fastest Fourier Transform in the West

The Fastest Fourier Transform in the West
Automatic generation of prime length FFT programs

IEEE Transactions on Signal Processing

SPL: a language and compiler for DSP algorithms

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Automatic Performance Tuning in the UHFFT Library

ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Run-Time Optimization Using Dynamic Performance Prediction

HPCN Europe 2000 Proceedings of the 8th International Conference on High-Performance Computing and Networking
Searching for the Best FFT Formulas with the SPL Compiler

LCPC '00 Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers
Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW

SAIG '00 Proceedings of the International Workshop on Semantics, Applications, and Implementation of Program Generation
Compiling Embedded Languages

SAIG '00 Proceedings of the International Workshop on Semantics, Applications, and Implementation of Program Generation
Knowledge Discovery in Auto-tuning Parallel Numerical Library

Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
Task and data parallelism in P3L

Patterns and skeletons for parallel and distributed computing
Linear analysis and optimization of stream programs

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Cache-Oblivious Algorithms

FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Self-adapting software for numerical linear algebra and LAPACK for clusters

Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
Compiling embedded languages

Journal of Functional Programming
Adaptive harmonic balance method for nonlinear time-periodic flows

Journal of Computational Physics
Effect of auto-tuning with user's knowledge for numerical software

Proceedings of the 1st conference on Computing frontiers
A Dynamically Tuned Sorting Library

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Finding effective compilation sequences

Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
A methodology for generating verified combinatorial circuits

Proceedings of the 4th ACM international conference on Embedded software
Optimizing Sorting with Genetic Algorithms

Proceedings of the international symposium on Code generation and optimization
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy

Proceedings of the international symposium on Code generation and optimization
Formal loop merging for signal transforms

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Automatic Performance Tuning for Fast Fourier Transforms

International Journal of High Performance Computing Applications
Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms

International Journal of High Performance Computing Applications
A monadic approach for avoiding code duplication when staging memoized functions

Proceedings of the 2006 ACM SIGPLAN symposium on Partial evaluation and semantics-based program manipulation
On the decidability of phase ordering problem in optimizing compilation

Proceedings of the 3rd conference on Computing frontiers
Automatic tuning of whole applications using direct search and a performance-based transformation system

The Journal of Supercomputing
Online performance auditing: using hot optimizations without getting burned

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
ABCLib_DRSSED: A parallel eigensolver with an auto-tuning facility

Parallel Computing
ABCLibScript: a directive to support specification of an auto-tuning facility for numerical software

Parallel Computing
Systems research challenges: a scale-out perspective

IBM Journal of Research and Development
Empirical optimization for a sparse linear solver: a case study

International Journal of Parallel Programming - Special issue: The next generation software program
Program generation for the all-pairs shortest path problem

Proceedings of the 15th international conference on Parallel architectures and compilation techniques
An Adaptive Algorithm Selection Framework for Reduction Parallelization

IEEE Transactions on Parallel and Distributed Systems
Profitable loop fusion and tiling using model-driven empirical search

Proceedings of the 20th annual international conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Five-step FFT algorithm with reduced computational complexity

Information Processing Letters
Compilation for explicitly managed memory hierarchies

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Lifting a butterfly - A component-based FFT

Scientific Programming - POOSC '01 Workshop
Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Scheduling FFT computation on SMP and multicore systems

Proceedings of the 21st annual international conference on Supercomputing
A portable runtime interface for multi-level memory hierarchies

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Type-II/III DCT/DST algorithms with reduced number of arithmetic operations

Signal Processing
A method to derive the cache performance of irregular applications on machines with direct mapped caches

International Journal of Computational Science and Engineering
Automatic Generation of FFT for Translations of Multipole Expansions in Spherical Harmonics

International Journal of High Performance Computing Applications
SharC: checking data sharing strategies for multithreaded c

Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

AMAST 2008 Proceedings of the 12th international conference on Algebraic Methodology and Software Technology
How to Write Fast Numerical Code: A Small Introduction

Generative and Transformational Techniques in Software Engineering II
P-Ray: A Software Suite for Multi-core Architecture Characterization

Languages and Compilers for Parallel Computing
Real-time fluid simulation using discrete sine/cosine transforms

Proceedings of the 2009 symposium on Interactive 3D graphics and games
Computation reuse in domain-specific optimization of signal recognition

Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Computer Generation of General Size Linear Transform Libraries

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Model-guided autotuning of high-productivity languages for petascale computing

Proceedings of the 18th ACM international symposium on High performance distributed computing
Operator Language: A Program Generation Framework for Fast Kernels

DSL '09 Proceedings of the IFIP TC 2 Working Conference on Domain-Specific Languages
Exploring parallelization strategies for NUFFT data translation

EMSOFT '09 Proceedings of the seventh ACM international conference on Embedded software
Vectorization techniques for the Blue Gene/L double FPU

IBM Journal of Research and Development
Optimization of data-flow computations using canonical TED representation

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Parallel implementations of 1-D fast Fourier transform without interprocessor communication

International Journal of Computers and Applications
Algorithms for memory hierarchies: advanced lectures

Algorithms for memory hierarchies: advanced lectures
CODELAB: a develpers' tool for efficient code generation and optimization

ICCS'03 Proceedings of the 2003 international conference on Computational science
Self-adapting numerical software and automatic tuning of heuristics

ICCS'03 Proceedings of the 2003 international conference on Computational science
Self-adapting numerical software and automatic tuning of heuristics

ICCS'03 Proceedings of the 2003 international conference on Computational science
Self-adapting software for numerical linear algebra library routines on clusters

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Automatic performance tuning for the multi-section with multiple eigenvalues method for symmetric tridiagonal eigenproblems

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
An empirically tuned 2D and 3D FFT library on CUDA GPU

Proceedings of the 24th ACM International Conference on Supercomputing
Language virtualization for heterogeneous parallel computing

Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Exposing tunable parameters in multi-threaded numerical code

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Scalable parallelization strategies to accelerate NuFFT data translation on multicores

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Gather/scatter hardware support for accelerating Fast Fourier Transform

Journal of Systems Architecture: the EUROMICRO Journal
An overview of the ECO project

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Domain-Specific Optimization of Signal Recognition Targeting FPGAs

ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Modeling and predicting the efficiency of application execution in distributed environments

ICCOMP'06 Proceedings of the 10th WSEAS international conference on Computers
Using GPUs to compute large out-of-card FFTs

Proceedings of the international conference on Supercomputing
Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets

Proceedings of the international conference on Supercomputing
Performance optimization by dynamic code transformation

Proceedings of the 8th ACM International Conference on Computing Frontiers
Performance analysis and tuning of automatically parallelized OpenMP applications

IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Automatic performance programming

Proceedings of the 10th SIGPLAN symposium on New ideas, new paradigms, and reflections on programming and software
Cache-Oblivious Algorithms

ACM Transactions on Algorithms (TALG)
Optimizing matrix multiplication with a classifier learning system

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
A systematic approach to model-guided empirical search for memory hierarchy optimization

LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
A data locality methodology for matrix---matrix multiplication algorithm

The Journal of Supercomputing
Relating FFTW and split-radix

ICESS'04 Proceedings of the First international conference on Embedded Software and Systems
Algorithmic-Parameter optimization of a parallelized split-step fourier transform using a modified BSP cost model

ISPA'04 Proceedings of the Second international conference on Parallel and Distributed Processing and Applications
Compiler technology for blue gene systems

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Automatically tuned FFTs for bluegene/l's double FPU

VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science
A code isolator: isolating code fragments from large programs

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
HiLO: high level optimization of FFTs

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
An evaluation towards automatically tuned eigensolvers

LSSC'05 Proceedings of the 5th international conference on Large-Scale Scientific Computing
DFT performance prediction in FFTW

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Effective source-to-source outlining to support whole program empirical optimization

LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
BrickX: building hybrid systems for recursive computations

ACM SIGMETRICS Performance Evaluation Review
Automatic performance optimization of the discrete fourier transform on distributed memory computers

ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs

Communications of the ACM
Extendable pattern-oriented optimization directives

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Improvement of the Discrete Cosine Transform calculation by means of a recursive method

Mathematical and Computer Modelling: An International Journal
Extendable pattern-oriented optimization directives

ACM Transactions on Architecture and Code Optimization (TACO)
A multi-objective auto-tuning framework for parallel codes

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A script-based autotuning compiler system to generate high-performance CUDA code

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Adaptive computation of self sorting in-place FFTs on hierarchical memory architectures

HPCC'07 Proceedings of the Third international conference on High Performance Computing and Communications
When polyhedral transformations meet SIMD code generation

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
A survey on smartphone-based systems for opportunistic user context recognition

ACM Computing Surveys (CSUR)
Precimonious: tuning assistant for floating-point precision

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
L24: Parallelism, performance, energy efficiency, and cost trade-offs in future sensor platforms

ACM Transactions on Embedded Computing Systems (TECS)
Spiral in scala: towards the systematic construction of generators for performance libraries

Proceedings of the 12th international conference on Generative programming: concepts & experiences
A Basic Linear Algebra Compiler

Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
Tile size selection revisited

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.02

Visualization

Abstract

The FFTW library for computing the discrete Fourier transform (DFT) has gained a wide acceptance in both academia and industry, because it provides excellent performance on a variety of machines (even competitive with or faster than equivalent libraries supplied by vendors). In FFTW, most of the performance-critical code was generated automatically by a special-purpose compiler, called genfft, that outputs C code. Written in Objective Caml, genfft can produce DFT programs for any input length, and it can specialize the DFT program for the common case where the input data are real instead of complex. Unexpectedly, genfft "discovered" algorithms that were previously unknown, and it was able to reduce the arithmetic complexity of some other existing algorithms. This paper describes the internals of this special-purpose compiler in some detail, and it argues that a specialized compiler is a valuable tool.