Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L

Authors:
S. Chatterjee;L. R. Bachega;P. Bergner;K. A. Dockser;J. A. Gunnels;M. Gupta;F. G. Gustavson;C. A. Lapkowski;G. K. Liu;M. Mendell;R. Nair;C. D. Wait;T. J. C. Ward;P. Wu
Affiliations:
IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;School of Electrical and Computer Engineering, Purdue University, West Lafayette, Indiana;IBM Systems and Technology Group, Rochester, Minnesota;Qualcomm CDMA Technologies, Cary, North Carolina;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Software Group, Toronto Laboratory, Markham, Ontario, Canada;IBM Software Group, Toronto Laboratory, Markham, Ontario, Canada;IBM Software Group, Toronto Laboratory, Markham, Ontario, Canada;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;IBM Engineering and Technology Services, Rochester, Minnesota;IBM United Kingdom Limited, Winchester, England;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York
Venue:
IBM Journal of Research and Development
Year:
2005

Citing 14
Cited 17

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Design of the IBM RISC System/6000 floating-point execution unit

IBM Journal of Research and Development
Register allocation via graph coloring

Register allocation via graph coloring
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

IBM Journal of Research and Development
Exploiting superword level parallelism with multimedia instruction sets

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
Exploiting Parallelism in Geometry Processing with General Purpose Processors and Floating-Point SIMD Instructions

IEEE Transactions on Computers
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
AltiVec Extension to PowerPC Accelerates Media Processing

IEEE Micro
New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms

Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
A comparison of empirical and model-driven optimization

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Vectorization for SIMD architectures with alignment constraints

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
A High-Performance SIMD Floating Point Unit for BlueGene/L: Architecture, Compilation, and Algorithm Design

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques

Blue matter on blue gene/L: massively parallel computation for biomolecular simulation

CODES+ISSS '05 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Optimization and evaluation of parallel molecular dynamics simulation on blue Gene/L

PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
EUDOC on the IBM Blue Gene/L system: accelerating the transfer of drug discoveries from laboratory to patient

IBM Journal of Research and Development
A massively parallel implementation of the common azimuth pre-stack depth migration

IBM Journal of Research and Development
Fine-grained parallelization of the Car-Parrinello ab initio molecular dynamics method on the IBM Blue Gene/L supercomputer

IBM Journal of Research and Development
Scalable molecular dynamics with NAMD on the IBM Blue Gene/L system

IBM Journal of Research and Development
Overview of the Blue Gene/L system architecture

IBM Journal of Research and Development
IBM PowerPC 440 FPU with complex-arithmetic extensions

IBM Journal of Research and Development
Minimal data copy for dense linear algebra factorization

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Is cache-oblivious DGEMM viable?

PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
The relevance of new data structure approaches for dense linear algebra in the new multi-core/many core environments

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
New data structures for matrices and specialized inner kernels: low overhead for high performance

PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Achieving strong scaling with NAMD on blue Gene/L

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Compiler technology for blue gene systems

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
New generalized data structures for matrices lead to a variety of high performance dense linear algebra algorithms

PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Cache blocking

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
Ultra-low-power adder stage design for exascale floating point units

ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe the design of a dual-issue single-instruction, multiple-data-like (SIMD-like) extension of the IBM PowerPC® 440 floating-point unit (FPU) core and the compiler and algorithmic techniques to exploit it. This extended FPU is targeted at both the IBM massively parallel Blue Gene®/L machine and the more pervasive embedded platforms. We discuss the hardware and software codesign that was essential in order to fully realize the performance benefits of the FPU when constrained by the memory bandwidth limitations and high penalties for misaligned data access imposed by the memory hierarchy on a Blue Gene/L node. Using both hand-optimized and compiled code for key linear algebraic kernels, we validate the architectural design choices, evaluate the success of the compiler, and quantify the effectiveness of the novel algorithm design techniques. Our measurements show that the combination of algorithm, compiler, and hardware delivers a significant fraction of peak floating-point performance for compute-bound-kernels, such as matrix multiplication, and delivers a significant fraction of peak memory bandwidth for memorybound kernels, such as DAXPY, while remaining largely insensitive to data alignment.