The WY representation for products of householder matrices
SIAM Journal on Scientific and Statistical Computing - Papers from the Second Conference on Parallel Processing for Scientific Computin
An extended set of FORTRAN basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
A storage-efficient WY representation for products of householder transformations
SIAM Journal on Scientific and Statistical Computing
A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
LAPACK: a portable linear algebra library for high-performance computers
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Introduction to parallel computing: design and analysis of algorithms
Introduction to parallel computing: design and analysis of algorithms
Level 3 BLAS for distributed memory concurrent computers
Environments and tools for parallel scientific computing
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
LAPACK Working Note 24: LAPACK Block Factorization Algorithms on the INtel iPSC/860
LAPACK Working Note 24: LAPACK Block Factorization Algorithms on the INtel iPSC/860
A Programming Methodology for Dual-Tier Multicomputers
IEEE Transactions on Software Engineering - Special issue on architecture-independent languages and software tools for parallel processing
Optimizing locality for ODE solvers
ICS '01 Proceedings of the 15th international conference on Supercomputing
A Proposal for a Heterogeneous Cluster ScaLAPACK (Dense Linear Solvers)
IEEE Transactions on Computers
A Distributed Framework for Parallel Data Mining Using HPJava
BT Technology Journal
Parallel Factorizations with Algorithmic Blocking
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Pipelining for Locality Improvement in RK Methods
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Using Pentangular Factorizations for the Reduction to Banded Form
Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
VECPAR '00 Selected Papers and Invited Talks from the 4th International Conference on Vector and Parallel Processing
Fault tolerant matrix operations using checksum and reverse computation
FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Fault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing
HPC-ASIA '97 Proceedings of the High-Performance Computing on the Information Superhighway, HPC-Asia '97
Self-adapting software for numerical linear algebra and LAPACK for clusters
Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
Performance optimization of RK methods using block-based pipelining
Performance analysis and grid computing
Architecture of an automatically tuned linear algebra library
Parallel Computing
A framework for adaptive algorithm selection in STAPL
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Optimizing locality and scalability of embedded Runge--Kutta solvers using block-based pipelining
Journal of Parallel and Distributed Computing
Self-adapting numerical software (SANS) effort
IBM Journal of Research and Development
Scalable Hybrid Designs for Linear Algebra on Reconfigurable Computing Systems
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
Data partitioning for multiprocessors with memory heterogeneity and memory constraints
Scientific Programming - International Symposium of Parallel and Distributed Computing & International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogenous Networks
Improving locality for ODE solvers by program transformations
Scientific Programming
Multi-threading and one-sided communication in parallel LU factorization
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Data parallel scheduling of operations in linear algebra on heterogeneous clusters
DIWEB'06 Proceedings of the 5th WSEAS International Conference on Distance Learning and Web Engineering
Block size selection of parallel LU and QR on PVP-based and RISC-based supercomputers
CHINA HPC '07 Proceedings of the 2007 Asian technology information program's (ATIP's) 3rd workshop on High performance computing in China: solution approaches to impediments for high performance computing
A simulator for adaptive parallel applications
Journal of Computer and System Sciences
Communication avoiding Gaussian elimination
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Design for Interoperability in stapl: pMatrices and Linear Algebra Algorithms
Languages and Compilers for Parallel Computing
Distributed SBP Cholesky factorization algorithms with near-optimal scheduling
ACM Transactions on Mathematical Software (TOMS)
Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Three algorithms for Cholesky factorization on distributed memory using packed storage
PARA'06 Proceedings of the 8th international conference on Applied parallel computing: state of the art in scientific computing
On iterative QR pre-processing in the parallel block-Jacobi SVD algorithm
Parallel Computing
Overlapping communication and computation by using a hybrid MPI/SMPSs approach
Proceedings of the 24th ACM International Conference on Supercomputing
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
CALU: A Communication Optimal LU Factorization Algorithm
SIAM Journal on Matrix Analysis and Applications
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Locality optimized shared-memory implementations of iterated runge-kutta methods
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures
ACM Transactions on Mathematical Software (TOMS)
Speeding up NEC electromagnetic simulation using GPU technology for antenna design problems
International Journal of Computing Science and Mathematics
Visualizing large-scale parallel communication traces using a particle animation technique
EuroVis '13 Proceedings of the 15th Eurographics Conference on Visualization
Hi-index | 0.00 |
This article discusses the core factorization routines included in the ScaLAPACK library. These routines allow the factorization and solution of a dense system of linear equations via LU, QR, and Cholesky. They are implemented using a block cyclic data distribution, and are built using de facto standard kernels for matrix and vector operations (BLAS and its parallel counterpart PBLAS) and message passing communication (BLACS). In implementing the ScaLAPACK routines, a major objective was to parallelize the corresponding sequential LAPACK using the BLAS, BLACS, and PBLAS as building blocks, leading to straightforward parallel implementations without a significant loss in performance. We present the details of the implementation of the ScaLAPACK factorization routines, as well as performance and scalability results on the Intel iPSC/860, Intel Touchstone Delta, and Intel Paragon System.