Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Authors:
Roman Wyrzykowski;Krzysztof Rojek;Lukasz Szustak
Affiliations:
Institute of Computer and Information Sciences, Czestochowa University of Technology, Poland;Institute of Computer and Information Sciences, Czestochowa University of Technology, Poland;Institute of Computer and Information Sciences, Czestochowa University of Technology, Poland
Venue:
Parallel Computing
Year:
2012

Citing 22
Cited 1

Performance optimizations and bounds for sparse matrix-vector multiply

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Tiling, Block Data Layout, and Memory Hierarchy Performance

IEEE Transactions on Parallel and Distributed Systems
The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers
Synergistic Processing in Cell's Multicore Architecture

IEEE Micro
Introduction to the cell multiprocessor

IBM Journal of Research and Development - POWER5 and packaging
Cell Multiprocessor Communication Network: Built for Speed

IEEE Micro
Cell broadband engine architecture and its first implementation: a performance view

IBM Journal of Research and Development
Scientific computing Kernels on the cell processor

International Journal of Parallel Programming
Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software (TOMS)
A Performance Model of Dense Matrix Operations on Many-Core Architectures

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization

IEEE Transactions on Parallel and Distributed Systems
Moving Scientific Codes to Multicore Microprocessor CPUs

Computing in Science and Engineering
Programming the Cell Broadband Engine Architecture: Examples and Best Practices

Programming the Cell Broadband Engine Architecture: Examples and Best Practices
Roofline: an insightful visual performance model for multicore architectures

Communications of the ACM - A Direct Path to Dependable Software
Implementing a parallel matrix factorization library on the cell broadband engine

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Programming the Linpack benchmark for the IBM PowerXCell 8i processor

Scientific Programming - High Performance Computing with the Cell Broadband Engine
Optimizing matrix multiplication for a short-vector SIMD architecture - CELL processor

Parallel Computing
Optimization of sparse matrix-vector multiplication on emerging multicore platforms

Parallel Computing
Model-driven autotuning of sparse matrix-vector multiply on GPUs

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Optimization of BLAS on the cell processor

HiPC'08 Proceedings of the 15th international conference on High performance computing
IBM BladeCenter QS22: design, performance, and utilization in hybrid computing systems

IBM Journal of Research and Development
Adaptation of double-precision matrix multiplication to the cell broadband engine architecture

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I

Vectorized OpenCL implementation of numerical integration for higher order finite elements

Computers & Mathematics with Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

The main delivery of this paper is a model-driven approach to adaptation of the double-precision matrix multiplication to architectures of blade systems based on two types of Cell processors. A hierarchical algorithm used for adaptation consists of four levels. The first level provides sharing computation among all the 16 SPE cores of the IBM BladeCenter QS21 or QS22. The second level corresponds to a macro-kernel, and is responsible for the data management in the main memory, as well as communication between the main memory and local stores of SPE cores. Each macro-kernel operation is implemented within the local store of an SPE core. The third level corresponds to a kernel of the algorithm; each kernel operation is implemented on a single SPE within its local store as a sequence of micro-kernel operations. The fourth level is a micro-kernel implemented within the register file of an SPE core. The proposed approach is based on two performance models. The purpose of the first model is optimization of communication across all 16 SPE cores of the IBM BladeCenter, including the main memory and local stores of SPEs. It is constructed as a function of size of matrix blocks. This model allows for selecting ''the best'' size of the macro-kernel. The second performance model is aiming at optimization of computations within a single SPE core, taking into account constraints on traffic between the local store and register file of SPE. The model accounts for such factors as size of local store, number of registers, properties of double-precision operations, balance between pipelines, etc. This model allows for selecting ''the best'' size of kernel and micro-kernel operations. The model-driven adaptation is followed by a series of systematic optimization steps. They include loop unrolling, double buffering on register and memory levels, as well as using NUMA library. The proposed adaptation and optimization steps are fully implemented in C language, without optimizing code manually. For the IBM QS21 system, which uses two Cell processors of the first generation, this implementation allows for achieving 27.24Gflop/s, which is 93.1% of the peak performance. This result is obtained for matrices of size 4096 by 4096. For the IBM QS22 system, based on PowerXCell 8i processors, the performance of double-precision arithmetic is extremely higher, so 184.4Gflop/s is achieved, as 90.0% of the peak performance. This result is reported for the matrix multiplication of size 15,872 by 15,872. The overall performance could be slightly improved by substituting the macro-kernel developed in this work with the highly optimized Cell BLAS dgemm_64x64 kernel.