Can dataflow subsume von Neumann computing?
ISCA '89 Proceedings of the 16th annual international symposium on Computer architecture
Improving locality and parallelism in nested loops
Improving locality and parallelism in nested loops
ScaLAPACK user's guide
Automatic selection of high-order transformations in the IBM XL FORTRAN compilers
IBM Journal of Research and Development - Special issue: performance analysis and its impact on design
Optimal fine and medium grain parallelism detection in polyhedral reduced dependence graphs
International Journal of Parallel Programming
Maximizing parallelism and minimizing synchronization with affine partitions
Parallel Computing - Special issues on languages and compilers for parallel computers
Algorithmic Redistribution Methods for Block-Cyclic Decompositions
IEEE Transactions on Parallel and Distributed Systems
Efficient Algorithms for Block-Cyclic Array Redistribution Between Processor Sets
IEEE Transactions on Parallel and Distributed Systems
Advanced Computer Architecture: Parallelism,Scalability,Programmability
Advanced Computer Architecture: Parallelism,Scalability,Programmability
A Framework for Efficient Data Redistribution on Distributed Memory Multicomputers
The Journal of Supercomputing
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
Grain Size Determination for Parallel Processing
IEEE Software
A Loop Transformation Theory and an Algorithm to Maximize Parallelism
IEEE Transactions on Parallel and Distributed Systems
A Block or Factorization Scheme for Loosely Coupled Systems of Array Processors
A Block or Factorization Scheme for Loosely Coupled Systems of Array Processors
LAPACK Working Note 80: The Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines
Hi-index | 0.00 |
This paper describes the design, implementation, and performance of three new parallel QR factorization algorithms: shared memory, synchronous message passing, and asynchronous message passing. In contrast to existing parallel algorithms, the multiprocessor partitioning strategy is not governed by an underlying static data distribution scheme. Rather, a dynamic distribution strategy is employed to improve scalability on small problems. Experiments conducted on a 128-processor SGI Origin 2000 and a 64-processor HP SPP-2000 show that the new algorithms have a lower execution time than available tuned parallel routines installed on the machines including a version of ScaLAPACK's distributed QR factorization algorithm PDGEQRF.