Computational frameworks for the fast Fourier transform
Computational frameworks for the fast Fourier transform
A fast Fourier transform compiler
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
Exploiting superword level parallelism with multimedia instruction sets
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Graph-based code selection techniques for embedded processors
ACM Transactions on Design Automation of Electronic Systems (TODAES)
A SIMD Vectorizing Compiler for Digital Signal Processing Algorithms
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Short Vector Code Generation for the Discrete Fourier Transform
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Spiral: A Generator for Platform-Adapted Libraries of Signal Processing Algorithms
International Journal of High Performance Computing Applications
Architecture independent short vector FFTs
ICASSP '01 Proceedings of the Acoustics, Speech, and Signal Processing, 200. on IEEE International Conference - Volume 02
Blue Gene/L programming and operating environment
IBM Journal of Research and Development
Blue matter on blue gene/L: massively parallel computation for biomolecular simulation
CODES+ISSS '05 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Auto-vectorization of interleaved data for SIMD
Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Astronomical real-time streaming signal processing on a Blue Gene/L supercomputer
Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures
Large-scale electronic structure calculations of high-Z metals on the BlueGene/L platform
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
BlueGene/L applications: Parallelism On a Massive Scale
International Journal of High Performance Computing Applications
A study of the effects of machine geometry and mapping on distributed transpose performance
Proceedings of the 5th conference on Computing frontiers
Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware
CASES '08 Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systems
Overview of the Blue Gene/L system architecture
IBM Journal of Research and Development
The LOFAR correlator: implementation and performance analysis
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
The LOFAR beam former: implementation and performance analysis
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Compiler technology for blue gene systems
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Automatically tuned FFTs for bluegene/l's double FPU
VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science
Performance measurements of the 3D FFT on the blue gene/l supercomputer
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
High performance 3D convolution for protein docking on IBM blue gene
ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Hi-index | 0.00 |
This paper presents vectorization techniques tailored to meet the specifics of the two-way single-instruction multiple-data (SIMD) double-precision floating-point unit (FPU), which is a core element of the node application-specific integrated circuit (ASIC) chips of the IBM 360-teraflops Blue Gene®/L supercomputer. This paper focuses on the general-purpose basic-block vectorization and optimization methods as they are incorporated in the Vienna MAP vectorizer and optimizer. The innovative technologies presented here, which have consistently delivered superior performance and portability across a wide range of platforms, were carried over to prototypes of Blue Gene/L and joined with the automatic performance-tuning system known as Fastest Fourier Transform in the West (FFTW). FFTW performance-optimization facilities working with the compiler technologies presented in this paper are able to produce vectorized fast Fourier transform (FFT) codes that are tuned automatically to single Blue Gene/L processors and are up to 80% faster than the best-performing scalar FFT codes generated by FFTW.