Designing and dynamically load balancing hybrid LU for multi/many-core

Authors:
Michael Deisher;Mikhail Smelyanskiy;Brian Nickerson;Victor W. Lee;Michael Chuvelev;Pradeep Dubey
Affiliations:
Intel Labs, Hillsboro, USA;Intel Labs, Santa Clara, USA;Intel Architecture Group, Santa Clara, USA;Intel Labs, Santa Clara, USA;Software and Solutions Group, Nizhny Novgorod, Russia;Intel Labs, Santa Clara, USA
Venue:
Computer Science - Research and Development
Year:
2011

Citing 5
Cited 4

Matrix computations (3rd ed.)

Matrix computations (3rd ed.)
FLAME: Formal Linear Algebra Methods Environment

ACM Transactions on Mathematical Software (TOMS)
Numerical Linear Algebra for High Performance Computers

Numerical Linear Algebra for High Performance Computers
Benchmarking GPUs to tune dense linear algebra

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Solving Dense Linear Systems on Graphics Processors

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing

Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Efficient backprojection-based synthetic aperture radar computation with many-core processors

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
MVAPICH-PRISM: a proxy-based communication framework using InfiniBand and SCIF for intel MIC clusters

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Efficient backprojection-based synthetic aperture radar computation with many-core processors

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

Designing high-performance LU factorization for modern hybrid multi/many-core systems requires highly-tuned BLAS subroutines, hiding communication latency and balancing the load across devices of variable processing capabilities. In this paper we show how single-precision LU factorization is accelerated on Intel® MIC(Many Integrated Core) architecture in both native and hybrid (Intel® Xeon® processor and Intel MIC) configurations. Our SGEMM implementation delivers close to 1 Tflop/s on Intel's first implementation of Intel MIC architecture [codenamed Knight's Ferry (KNF)] silicon platform. Our implementation takes full advantage of multiple levels of memory hierarchy on MIC, and successfully utilizes up to 80% of its peak compute capability. Our LU factorization performance exceeds 570 Gflop/s including matrix transfer overhead when executed entirely on a KNF coprocessor. Our hybrid implementation, which offloads parts of LU processing to a dual-socket multi-core Intel Xeon processor X5680 host, delivers up to 772 Gflop/s. The novel aspect of our implementations is dynamic resource partitioning to improve load balance across the entire system.