Strategies for cache and local memory management by global program transformation
Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
A storage-efficient WY representation for products of householder transformations
SIAM Journal on Scientific and Statistical Computing
Proceedings of the 1989 ACM/IEEE conference on Supercomputing
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Block-cyclic dense linear algebra
SIAM Journal on Scientific Computing
Improving the ratio of memory operations to floating-point operations in loops
ACM Transactions on Programming Languages and Systems (TOPLAS)
Quantifying the multi-level nature of tiling interactions
International Journal of Parallel Programming
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Architecture and Performance of the Hitachi SR2201 Massively Parallel Processor System
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Iteration Space Tiling for Memory Hierarchies
Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing
Optimal Data Distributions for LU Decomposition
Euro-Par '95 Proceedings of the First International Euro-Par Conference on Parallel Processing
LAPACK Working Note 73: Basic Linear Algebra Communication Subprograms: Analysis and Implementation Across Multiple Parallel Architectures
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines
Scientific Programming
Study of neural net training methods in parallel and distributed architectures
Future Generation Computer Systems
Parallel implementation of a neural net training application in a heterogeneous grid environment
OTM'07 Proceedings of the 2007 OTM confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part II
Hi-index | 0.00 |
In this paper, we proposed a unified framework and tried to address the optimal block size selection problem for parallel blocked LU and QR factorization algorithm used in ScaLAPACK package, since they use two dimensional block cyclic data distribution fashion [12], block size plays important role in determining the final performance. Through the analysis with our proposed framework and experiments on small scale system configuration, we found that among all factors that affect performance, load balance and local block size selection play key roles in determining the optimal block size on two different type parallel computing platforms: SR2201 (PVP(Pseudo-Vector Processing) based MPP machine) and DAWNING 3000(RISC-based SMP cluster). In fact, the optimal parallel block size is determined by the processor grid shape and problem size. Based on this observation, optimal block size prediction formula for double precision real parallel blocked LU and QR on SR2201 and DAWNING3000 with processor grid shape and problem size as parameters were given, whose prediction results can match well with the large scale system configuration and large problem size experimental results. The application of our framework on other parallel machines and on other applications program wound be the future work.