A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Design of the IBM RISC System/6000 floating-point execution unit
IBM Journal of Research and Development
Register allocation via graph coloring
Register allocation via graph coloring
Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms
IBM Journal of Research and Development
Exploiting superword level parallelism with multimedia instruction sets
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Basic Linear Algebra Subprograms for Fortran Usage
ACM Transactions on Mathematical Software (TOMS)
IEEE Transactions on Computers
FLAME: Formal Linear Algebra Methods Environment
ACM Transactions on Mathematical Software (TOMS)
Automatically tuned linear algebra software
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
New Generalized Matrix Data Structures Lead to a Variety of High-Performance Algorithms
Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software
A comparison of empirical and model-driven optimization
PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Vectorization for SIMD architectures with alignment constraints
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Blue matter on blue gene/L: massively parallel computation for biomolecular simulation
CODES+ISSS '05 Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis
Optimization and evaluation of parallel molecular dynamics simulation on blue Gene/L
PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
IBM Journal of Research and Development
A massively parallel implementation of the common azimuth pre-stack depth migration
IBM Journal of Research and Development
IBM Journal of Research and Development
Scalable molecular dynamics with NAMD on the IBM Blue Gene/L system
IBM Journal of Research and Development
Overview of the Blue Gene/L system architecture
IBM Journal of Research and Development
IBM PowerPC 440 FPU with complex-arithmetic extensions
IBM Journal of Research and Development
Minimal data copy for dense linear algebra factorization
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
Is cache-oblivious DGEMM viable?
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
New data structures for matrices and specialized inner kernels: low overhead for high performance
PPAM'07 Proceedings of the 7th international conference on Parallel processing and applied mathematics
Achieving strong scaling with NAMD on blue Gene/L
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Compiler technology for blue gene systems
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
Ultra-low-power adder stage design for exascale floating point units
ACM Transactions on Embedded Computing Systems (TECS) - Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular Papers
Hi-index | 0.00 |
We describe the design of a dual-issue single-instruction, multiple-data-like (SIMD-like) extension of the IBM PowerPC® 440 floating-point unit (FPU) core and the compiler and algorithmic techniques to exploit it. This extended FPU is targeted at both the IBM massively parallel Blue Gene®/L machine and the more pervasive embedded platforms. We discuss the hardware and software codesign that was essential in order to fully realize the performance benefits of the FPU when constrained by the memory bandwidth limitations and high penalties for misaligned data access imposed by the memory hierarchy on a Blue Gene/L node. Using both hand-optimized and compiled code for key linear algebraic kernels, we validate the architectural design choices, evaluate the success of the compiler, and quantify the effectiveness of the novel algorithm design techniques. Our measurements show that the combination of algorithm, compiler, and hardware delivers a significant fraction of peak floating-point performance for compute-bound-kernels, such as matrix multiplication, and delivers a significant fraction of peak memory bandwidth for memorybound kernels, such as DAXPY, while remaining largely insensitive to data alignment.