A High-Performance SIMD Floating Point Unit for BlueGene/L: Architecture, Compilation, and Algorithm Design

Authors:
Leonardo Bachega;Siddhartha Chatterjee;Kenneth A. Dockser;John A. Gunnels;Manish Gupta;Fred G. Gustavson;Christopher A. Lapkowski;Gary K. Liu;Mark P. Mendell;Charles D. Wait;T. J. Chris Ward
Affiliations:
IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM Corporation, Research Triangle Park, NC;IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM T. J. Watson Research Center, Yorktown Heights, NY;IBM Corporation, Markham, ON, Canada;IBM Corporation, Markham, ON, Canada;IBM Corporation, Markham, ON, Canada;IBM Corporation, Rochester, MN;IBM T. J. Watson Research Center, Yorktown Heights, NY
Venue:
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Year:
2004

Citing 13
Cited 15

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Design of the IBM RISC System/6000 floating-point execution unit

IBM Journal of Research and Development
Register allocation via graph coloring

Register allocation via graph coloring
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms

IBM Journal of Research and Development
Exploiting superword level parallelism with multimedia instruction sets

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Basic Linear Algebra Subprograms for Fortran Usage

ACM Transactions on Mathematical Software (TOMS)
Exploiting Parallelism in Geometry Processing with General Purpose Processors and Floating-Point SIMD Instructions

IEEE Transactions on Computers
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
Automatically tuned linear algebra software

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
AltiVec Extension to PowerPC Accelerates Media Processing

IEEE Micro
New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms

Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
A comparison of empirical and model-driven optimization

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Vectorization for SIMD architectures with alignment constraints

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation

Scaling physics and material science applications on a massively parallel Blue Gene/L system

Proceedings of the 19th annual international conference on Supercomputing
Large-Scale First-Principles Molecular Dynamics simulations on the BlueGene/L Platform using the Qbox code

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Tera-Scalable Algorithms for Variable-Density Elliptic Hydrodynamics with Spectral Accuracy

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Auto-vectorization of interleaved data for SIMD

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Large-scale electronic structure calculations of high-Z metals on the BlueGene/L platform

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Large scale drop impact analysis of mobile phone using ADVC on Blue Gene/L

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
BlueGene/L applications: Parallelism On a Massive Scale

International Journal of High Performance Computing Applications
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
WRF nature run

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Blue Gene/L programming and operating environment

IBM Journal of Research and Development
Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L

IBM Journal of Research and Development
Performance and power evaluation of an in-line accelerator

Proceedings of the 7th ACM international conference on Computing frontiers
Automatic vector instruction selection for dynamic compilation

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Efficient Selection of Vector Instructions Using Dynamic Programming

MICRO '43 Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture
Early experience with scientific applications on the blue gene/l supercomputer

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We describe the design, implementation, and evaluation of a dual-issue SIMD-like extension of the PowerPC 440 floating-point unit (FPU) core. This extended FPU is targeted at both IBM's massively parallel Blue-Gene/L machine as well as more pervasive embedded platforms. It has several novel features, such as a computational crossbar and cross-load/store instructions, which enhance the performance of numerical codes. We further discuss the hardware-software co-design that was essential to fully realize the performance benefits of the FPU when constrained by the memory bandwidth limitations and high penalties for misaligned data access imposed by the memory hierarchy on a BlueGene/L node. We describe several novel compiler and algorithmic techniques to take advantage of this architecture. Using both hand-optimized and compiled code for key linear algebraic kernels, we validate the architectural design choices, evaluate the success of the compiler, and quantify the effectiveness of the novel algorithm design techniques. Preliminary performance data shows that the algorithm-compiler-hardware combination delivers a significant fraction of peak floating-point performance for compute-bound kernels such as matrix multiplication, and delivers a significant fraction of peak memory bandwidth for memory-bound kernels such as daxpy, while being largely insensitive to data alignment.