Block size selection of parallel LU and QR on PVP-based and RISC-based supercomputers

Authors:
Yunquan Zhang;Ying Chen;Yuan Tang
Affiliations:
CAS, Beijing, P. R. China;CAS, Beijing, P. R. China;Fudan University, Shanghai, P. R. China
Venue:
CHINA HPC '07 Proceedings of the 2007 Asian technology information program's (ATIP's) 3rd workshop on High performance computing in China: solution approaches to impediments for high performance computing
Year:
2007

Citing 14
Cited 2

Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
A storage-efficient WY representation for products of householder transformations

SIAM Journal on Scientific and Statistical Computing
More iteration space tiling

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Block-cyclic dense linear algebra

SIAM Journal on Scientific Computing
Improving the ratio of memory operations to floating-point operations in loops

ACM Transactions on Programming Languages and Systems (TOPLAS)
Quantifying the multi-level nature of tiling interactions

International Journal of Parallel Programming
ScaLAPACK: a portable linear algebra library for distributed memory computers - design issues and performance

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Architecture and Performance of the Hitachi SR2201 Massively Parallel Processor System

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Iteration Space Tiling for Memory Hierarchies

Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
Optimal Data Distributions for LU Decomposition

Euro-Par '95 Proceedings of the First International Euro-Par Conference on Parallel Processing
LAPACK Working Note 73: Basic Linear Algebra Communication Subprograms: Analysis and Implementation Across Multiple Parallel Architectures

LAPACK Working Note 73: Basic Linear Algebra Communication Subprograms: Analysis and Implementation Across Multiple Parallel Architectures
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines

Scientific Programming

Study of neural net training methods in parallel and distributed architectures

Future Generation Computer Systems
Parallel implementation of a neural net training application in a heterogeneous grid environment

OTM'07 Proceedings of the 2007 OTM confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we proposed a unified framework and tried to address the optimal block size selection problem for parallel blocked LU and QR factorization algorithm used in ScaLAPACK package, since they use two dimensional block cyclic data distribution fashion [12], block size plays important role in determining the final performance. Through the analysis with our proposed framework and experiments on small scale system configuration, we found that among all factors that affect performance, load balance and local block size selection play key roles in determining the optimal block size on two different type parallel computing platforms: SR2201 (PVP(Pseudo-Vector Processing) based MPP machine) and DAWNING 3000(RISC-based SMP cluster). In fact, the optimal parallel block size is determined by the processor grid shape and problem size. Based on this observation, optimal block size prediction formula for double precision real parallel blocked LU and QR on SR2201 and DAWNING3000 with processor grid shape and problem size as parameters were given, whose prediction results can match well with the large scale system configuration and large problem size experimental results. The application of our framework on other parallel machines and on other applications program wound be the future work.