IEEE Transactions on Computers
FFTs in external or hierarchical memory
The Journal of Supercomputing
Supercompilers for parallel and vector computers
Supercompilers for parallel and vector computers
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Circuits, Systems, and Signal Processing
Computational frameworks for the fast Fourier transform
Computational frameworks for the fast Fourier transform
Compiling Fortran D for MIMD distributed-memory machines
Communications of the ACM
Architecture-cognizant divide and conquer algorithms
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Parallel programming in OpenMP
Parallel programming in OpenMP
Organizing matrices and matrix operations for paged memory systems
Communications of the ACM
SPL: a language and compiler for DSP algorithms
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
FLAME: Formal Linear Algebra Methods Environment
ACM Transactions on Mathematical Software (TOMS)
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
Stochastic search for signal processing algorithm optimization
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A Blocking Algorithm for Parallel 1-D FFT on Shared-Memory Parallel Computers
PARA '02 Proceedings of the 6th International Conference on Applied Parallel Computing Advanced Scientific Computing
Short Vector Code Generation for the Discrete Fourier Transform
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
The science of deriving dense linear algebra algorithms
ACM Transactions on Mathematical Software (TOMS)
Formal loop merging for signal transforms
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Sparsity: Optimization Framework for Sparse Matrix Kernels
International Journal of High Performance Computing Applications
A rewriting system for the vectorization of signal transforms
VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
An OpenMP implementation of parallel FFT and its performance on IA-64 processors
WOMPAT'03 Proceedings of the OpenMP applications and tools 2003 international conference on OpenMP shared memory parallel programming
Scheduling FFT computation on SMP and multicore systems
Proceedings of the 21st annual international conference on Supercomputing
System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries
AMAST 2008 Proceedings of the 12th international conference on Algebraic Methodology and Software Technology
How to Write Fast Numerical Code: A Small Introduction
Generative and Transformational Techniques in Software Engineering II
Computer generation of fast fourier transforms for the cell broadband engine
Proceedings of the 23rd international conference on Supercomputing
Computer Generation of General Size Linear Transform Libraries
Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization
Operator Language: A Program Generation Framework for Fast Kernels
DSL '09 Proceedings of the IFIP TC 2 Working Conference on Domain-Specific Languages
Automatic synthesis of high performance mathematical programs
Proceedings of the 2009 international symposium on Symbolic and algebraic computation
Exploring parallelization strategies for NUFFT data translation
EMSOFT '09 Proceedings of the seventh ACM international conference on Embedded software
A rewriting system for the vectorization of signal transforms
VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
Spiral-generated modular FFT algorithms
Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
Gather/scatter hardware support for accelerating Fast Fourier Transform
Journal of Systems Architecture: the EUROMICRO Journal
Using GPUs to compute large out-of-card FFTs
Proceedings of the international conference on Supercomputing
Automatic performance programming
Proceedings of the 10th SIGPLAN symposium on New ideas, new paradigms, and reflections on programming and software
PEPM '12 Proceedings of the ACM SIGPLAN 2012 workshop on Partial evaluation and program manipulation
Computer Generation of Hardware for Linear Digital Signal Processing Transforms
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Automatic performance optimization of the discrete fourier transform on distributed memory computers
ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Computer generation of efficient software viterbi decoders
HiPEAC'10 Proceedings of the 5th international conference on High Performance Embedded Architectures and Compilers
A transpose-free in-place SIMD optimized FFT
ACM Transactions on Architecture and Code Optimization (TACO)
Adaptive computation of self sorting in-place FFTs on hierarchical memory architectures
HPCC'07 Proceedings of the Third international conference on High Performance Computing and Communications
Spiral in scala: towards the systematic construction of generators for performance libraries
Proceedings of the 12th international conference on Generative programming: concepts & experiences
Hi-index | 0.02 |
The chip maker's response to the approaching end of CPU frequency scaling are multicore systems, which offer the same programming paradigm as traditional shared memory platforms but have different performance characteristics. This situation considerably increases the burden on library developers and strengthens the case for automatic performance tuning frameworks like Spiral, a program generator and optimizer for linear transforms such as the discrete Fourier transform (DFT). We present a shared memory extension of Spiral. The extension within Spiral consists of a rewriting system that manipulates the structure of transform algorithms to achieve load balancing and avoids false sharing, and of a backend to generate multithreaded code. Application to the DFT produces a novel class of algorithms suitable for multicore systems as validated by experimental results: we demonstrate a parallelization speed-up already for sizes that fit into L1 cache and compare favorably to other DFT libraries across all small and midsize DFTs and considered platforms.