FFT program generation for shared memory: SMP and multicore

Authors:
Franz Franchetti;Yevgen Voronenko;Markus Püschel
Affiliations:
Carnegie Mellon University;Carnegie Mellon University;Carnegie Mellon University
Venue:
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Year:
2006

Citing 22
Cited 22

Parallelization and Performance Analysis of the Cooley-Tukey FFT Algorithm for Shared-Memory Architectures

IEEE Transactions on Computers
FFTs in external or hierarchical memory

The Journal of Supercomputing
Supercompilers for parallel and vector computers

Supercompilers for parallel and vector computers
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
A methodology for designing, modifying, and implementing Fourier transform algorithms on various architectures

Circuits, Systems, and Signal Processing
Computational frameworks for the fast Fourier transform

Computational frameworks for the fast Fourier transform
Compiling Fortran D for MIMD distributed-memory machines

Communications of the ACM
Architecture-cognizant divide and conquer algorithms

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Parallel programming in OpenMP

Parallel programming in OpenMP
Organizing matrices and matrix operations for paged memory systems

Communications of the ACM
SPL: a language and compiler for DSP algorithms

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Stochastic search for signal processing algorithm optimization

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A Blocking Algorithm for Parallel 1-D FFT on Shared-Memory Parallel Computers

PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
Short Vector Code Generation for the Discrete Fourier Transform

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
The science of deriving dense linear algebra algorithms

ACM Transactions on Mathematical Software (TOMS)
Formal loop merging for signal transforms

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Sparsity: Optimization Framework for Sparse Matrix Kernels

International Journal of High Performance Computing Applications
A rewriting system for the vectorization of signal transforms

VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
An OpenMP implementation of parallel FFT and its performance on IA-64 processors

WOMPAT'03 Proceedings of the OpenMP applications and tools 2003 international conference on OpenMP shared memory parallel programming

Scheduling FFT computation on SMP and multicore systems

Proceedings of the 21st annual international conference on Supercomputing
System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

AMAST 2008 Proceedings of the 12th international conference on Algebraic Methodology and Software Technology
How to Write Fast Numerical Code: A Small Introduction

Generative and Transformational Techniques in Software Engineering II
Computer generation of fast fourier transforms for the cell broadband engine

Proceedings of the 23rd international conference on Supercomputing
Computer Generation of General Size Linear Transform Libraries

Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Operator Language: A Program Generation Framework for Fast Kernels

DSL '09 Proceedings of the IFIP TC 2 Working Conference on Domain-Specific Languages
Automatic synthesis of high performance mathematical programs

Proceedings of the 2009 international symposium on Symbolic and algebraic computation
Exploring parallelization strategies for NUFFT data translation

EMSOFT '09 Proceedings of the seventh ACM international conference on Embedded software
A rewriting system for the vectorization of signal transforms

VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
Automated performance tuning

Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
Spiral-generated modular FFT algorithms

Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
Gather/scatter hardware support for accelerating Fast Fourier Transform

Journal of Systems Architecture: the EUROMICRO Journal
Using GPUs to compute large out-of-card FFTs

Proceedings of the international conference on Supercomputing
DP-Fair: a unifying theory for optimal hard real-time multiprocessor scheduling

Real-Time Systems
Automatic performance programming

Proceedings of the 10th SIGPLAN symposium on New ideas, new paradigms, and reflections on programming and software
Compiling math to fast code

PEPM '12 Proceedings of the ACM SIGPLAN 2012 workshop on Partial evaluation and program manipulation
Computer Generation of Hardware for Linear Digital Signal Processing Transforms

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Automatic performance optimization of the discrete fourier transform on distributed memory computers

ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Computer generation of efficient software viterbi decoders

HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
A transpose-free in-place SIMD optimized FFT

ACM Transactions on Architecture and Code Optimization (TACO)
Adaptive computation of self sorting in-place FFTs on hierarchical memory architectures

HPCC'07 Proceedings of the Third international conference on High Performance Computing and Communications
Spiral in scala: towards the systematic construction of generators for performance libraries

Proceedings of the 12th international conference on Generative programming: concepts & experiences

Quantified Score

Hi-index	0.02

Visualization

Abstract

The chip maker's response to the approaching end of CPU frequency scaling are multicore systems, which offer the same programming paradigm as traditional shared memory platforms but have different performance characteristics. This situation considerably increases the burden on library developers and strengthens the case for automatic performance tuning frameworks like Spiral, a program generator and optimizer for linear transforms such as the discrete Fourier transform (DFT). We present a shared memory extension of Spiral. The extension within Spiral consists of a rewriting system that manipulates the structure of transform algorithms to achieve load balancing and avoids false sharing, and of a backend to generate multithreaded code. Application to the DFT produces a novel class of algorithms suitable for multicore systems as validated by experimental results: we demonstrate a parallelization speed-up already for sizes that fit into L1 cache and compare favorably to other DFT libraries across all small and midsize DFTs and considered platforms.