Compilers: principles, techniques, and tools
Compilers: principles, techniques, and tools
The input/output complexity of sorting and related problems
Communications of the ACM
Discrete-time signal processing
Discrete-time signal processing
Fast fourier transforms: a tutorial review and a state of the art
Signal Processing
Factorization method for crystallographic Fourier transforms
Advances in Applied Mathematics
A framework for generating distributed-memory parallel programs for block recursive algorithms
Journal of Parallel and Distributed Computing
An analysis of dag-consistent distributed shared-memory algorithms
Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Using C++ template metaprograms
C++ gems
ACM Computing Surveys (CSUR)
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Advanced compiler design and implementation
Advanced compiler design and implementation
The nofib Benchmark Suite of Haskell Programs
Proceedings of the 1992 Glasgow Workshop on Functional Programming
I/O complexity: The red-blue pebble game
STOC '81 Proceedings of the thirteenth annual ACM symposium on Theory of computing
The Fastest Fourier Transform in the West
The Fastest Fourier Transform in the West
Automatic generation of prime length FFT programs
IEEE Transactions on Signal Processing
SPL: a language and compiler for DSP algorithms
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Automatic Performance Tuning in the UHFFT Library
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Run-Time Optimization Using Dynamic Performance Prediction
HPCN Europe 2000 Proceedings of the 8th International Conference on High-Performance Computing and Networking
Searching for the Best FFT Formulas with the SPL Compiler
LCPC '00 Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers
Code Generators for Automatic Tuning of Numerical Kernels: Experiences with FFTW
SAIG '00 Proceedings of the International Workshop on Semantics, Applications, and Implementation of Program Generation
SAIG '00 Proceedings of the International Workshop on Semantics, Applications, and Implementation of Program Generation
Knowledge Discovery in Auto-tuning Parallel Numerical Library
Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
Task and data parallelism in P3L
Patterns and skeletons for parallel and distributed computing
Linear analysis and optimization of stream programs
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
FOCS '99 Proceedings of the 40th Annual Symposium on Foundations of Computer Science
Self-adapting software for numerical linear algebra and LAPACK for clusters
Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
Journal of Functional Programming
Adaptive harmonic balance method for nonlinear time-periodic flows
Journal of Computational Physics
Effect of auto-tuning with user's knowledge for numerical software
Proceedings of the 1st conference on Computing frontiers
A Dynamically Tuned Sorting Library
Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Finding effective compilation sequences
Proceedings of the 2004 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
A methodology for generating verified combinatorial circuits
Proceedings of the 4th ACM international conference on Embedded software
Optimizing Sorting with Genetic Algorithms
Proceedings of the international symposium on Code generation and optimization
Combining Models and Guided Empirical Search to Optimize for Multiple Levels of the Memory Hierarchy
Proceedings of the international symposium on Code generation and optimization
Formal loop merging for signal transforms
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Automatic Performance Tuning for Fast Fourier Transforms
International Journal of High Performance Computing Applications
Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms
International Journal of High Performance Computing Applications
A monadic approach for avoiding code duplication when staging memoized functions
Proceedings of the 2006 ACM SIGPLAN symposium on Partial evaluation and semantics-based program manipulation
On the decidability of phase ordering problem in optimizing compilation
Proceedings of the 3rd conference on Computing frontiers
The Journal of Supercomputing
Online performance auditing: using hot optimizations without getting burned
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
ABCLib_DRSSED: A parallel eigensolver with an auto-tuning facility
Parallel Computing
Systems research challenges: a scale-out perspective
IBM Journal of Research and Development
Empirical optimization for a sparse linear solver: a case study
International Journal of Parallel Programming - Special issue: The next generation software program
Program generation for the all-pairs shortest path problem
Proceedings of the 15th international conference on Parallel architectures and compilation techniques
An Adaptive Algorithm Selection Framework for Reduction Parallelization
IEEE Transactions on Parallel and Distributed Systems
Profitable loop fusion and tiling using model-driven empirical search
Proceedings of the 20th annual international conference on Supercomputing
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Five-step FFT algorithm with reduced computational complexity
Information Processing Letters
Compilation for explicitly managed memory hierarchies
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Lifting a butterfly - A component-based FFT
Scientific Programming - POOSC '01 Workshop
Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Scheduling FFT computation on SMP and multicore systems
Proceedings of the 21st annual international conference on Supercomputing
A portable runtime interface for multi-level memory hierarchies
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
International Journal of Computational Science and Engineering
Automatic Generation of FFT for Translations of Multipole Expansions in Spherical Harmonics
International Journal of High Performance Computing Applications
SharC: checking data sharing strategies for multithreaded c
Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation
System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries
AMAST 2008 Proceedings of the 12th international conference on Algebraic Methodology and Software Technology
How to Write Fast Numerical Code: A Small Introduction
Generative and Transformational Techniques in Software Engineering II
P-Ray: A Software Suite for Multi-core Architecture Characterization
Languages and Compilers for Parallel Computing
Real-time fluid simulation using discrete sine/cosine transforms
Proceedings of the 2009 symposium on Interactive 3D graphics and games
Computation reuse in domain-specific optimization of signal recognition
Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays
Computer Generation of General Size Linear Transform Libraries
Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Model-guided autotuning of high-productivity languages for petascale computing
Proceedings of the 18th ACM international symposium on High performance distributed computing
Operator Language: A Program Generation Framework for Fast Kernels
DSL '09 Proceedings of the IFIP TC 2 Working Conference on Domain-Specific Languages
Exploring parallelization strategies for NUFFT data translation
EMSOFT '09 Proceedings of the seventh ACM international conference on Embedded software
Vectorization techniques for the Blue Gene/L double FPU
IBM Journal of Research and Development
Optimization of data-flow computations using canonical TED representation
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Parallel implementations of 1-D fast Fourier transform without interprocessor communication
International Journal of Computers and Applications
Algorithms for memory hierarchies: advanced lectures
Algorithms for memory hierarchies: advanced lectures
CODELAB: a develpers' tool for efficient code generation and optimization
ICCS'03 Proceedings of the 2003 international conference on Computational science
Self-adapting numerical software and automatic tuning of heuristics
ICCS'03 Proceedings of the 2003 international conference on Computational science
Self-adapting numerical software and automatic tuning of heuristics
ICCS'03 Proceedings of the 2003 international conference on Computational science
Self-adapting software for numerical linear algebra library routines on clusters
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
An empirically tuned 2D and 3D FFT library on CUDA GPU
Proceedings of the 24th ACM International Conference on Supercomputing
Language virtualization for heterogeneous parallel computing
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Exposing tunable parameters in multi-threaded numerical code
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Scalable parallelization strategies to accelerate NuFFT data translation on multicores
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Gather/scatter hardware support for accelerating Fast Fourier Transform
Journal of Systems Architecture: the EUROMICRO Journal
An overview of the ECO project
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Domain-Specific Optimization of Signal Recognition Targeting FPGAs
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Modeling and predicting the efficiency of application execution in distributed environments
ICCOMP'06 Proceedings of the 10th WSEAS international conference on Computers
Using GPUs to compute large out-of-card FFTs
Proceedings of the international conference on Supercomputing
Automatic SIMD vectorization of fast fourier transforms for the larrabee and AVX instruction sets
Proceedings of the international conference on Supercomputing
Performance optimization by dynamic code transformation
Proceedings of the 8th ACM International Conference on Computing Frontiers
Performance analysis and tuning of automatically parallelized OpenMP applications
IWOMP'11 Proceedings of the 7th international conference on OpenMP in the Petascale era
Automatic performance programming
Proceedings of the 10th SIGPLAN symposium on New ideas, new paradigms, and reflections on programming and software
ACM Transactions on Algorithms (TALG)
Optimizing matrix multiplication with a classifier learning system
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
A systematic approach to model-guided empirical search for memory hierarchy optimization
LCPC'05 Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing
A data locality methodology for matrix---matrix multiplication algorithm
The Journal of Supercomputing
ICESS'04 Proceedings of the First international conference on Embedded Software and Systems
ISPA'04 Proceedings of the Second international conference on Parallel and Distributed Processing and Applications
Compiler technology for blue gene systems
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Automatically tuned FFTs for bluegene/l's double FPU
VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science
A code isolator: isolating code fragments from large programs
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
HiLO: high level optimization of FFTs
LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing
An evaluation towards automatically tuned eigensolvers
LSSC'05 Proceedings of the 5th international conference on Large-Scale Scientific Computing
DFT performance prediction in FFTW
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
Effective source-to-source outlining to support whole program empirical optimization
LCPC'09 Proceedings of the 22nd international conference on Languages and Compilers for Parallel Computing
BrickX: building hybrid systems for recursive computations
ACM SIGMETRICS Performance Evaluation Review
Automatic performance optimization of the discrete fourier transform on distributed memory computers
ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs
Communications of the ACM
Extendable pattern-oriented optimization directives
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Improvement of the Discrete Cosine Transform calculation by means of a recursive method
Mathematical and Computer Modelling: An International Journal
Extendable pattern-oriented optimization directives
ACM Transactions on Architecture and Code Optimization (TACO)
A multi-objective auto-tuning framework for parallel codes
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A script-based autotuning compiler system to generate high-performance CUDA code
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Adaptive computation of self sorting in-place FFTs on hierarchical memory architectures
HPCC'07 Proceedings of the Third international conference on High Performance Computing and Communications
When polyhedral transformations meet SIMD code generation
Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
A survey on smartphone-based systems for opportunistic user context recognition
ACM Computing Surveys (CSUR)
Precimonious: tuning assistant for floating-point precision
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
L24: Parallelism, performance, energy efficiency, and cost trade-offs in future sensor platforms
ACM Transactions on Embedded Computing Systems (TECS)
Spiral in scala: towards the systematic construction of generators for performance libraries
Proceedings of the 12th international conference on Generative programming: concepts & experiences
A Basic Linear Algebra Compiler
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization
ACM Transactions on Architecture and Code Optimization (TACO)
Hi-index | 0.02 |
The FFTW library for computing the discrete Fourier transform (DFT) has gained a wide acceptance in both academia and industry, because it provides excellent performance on a variety of machines (even competitive with or faster than equivalent libraries supplied by vendors). In FFTW, most of the performance-critical code was generated automatically by a special-purpose compiler, called genfft, that outputs C code. Written in Objective Caml, genfft can produce DFT programs for any input length, and it can specialize the DFT program for the common case where the input data are real instead of complex. Unexpectedly, genfft "discovered" algorithms that were previously unknown, and it was able to reduce the arithmetic complexity of some other existing algorithms. This paper describes the internals of this special-purpose compiler in some detail, and it argues that a specialized compiler is a valuable tool.