Designing and dynamically load balancing hybrid LU for multi/many-core

  • Authors:
  • Michael Deisher;Mikhail Smelyanskiy;Brian Nickerson;Victor W. Lee;Michael Chuvelev;Pradeep Dubey

  • Affiliations:
  • Intel Labs, Hillsboro, USA;Intel Labs, Santa Clara, USA;Intel Architecture Group, Santa Clara, USA;Intel Labs, Santa Clara, USA;Software and Solutions Group, Nizhny Novgorod, Russia;Intel Labs, Santa Clara, USA

  • Venue:
  • Computer Science - Research and Development
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Designing high-performance LU factorization for modern hybrid multi/many-core systems requires highly-tuned BLAS subroutines, hiding communication latency and balancing the load across devices of variable processing capabilities. In this paper we show how single-precision LU factorization is accelerated on Intel® MIC(Many Integrated Core) architecture in both native and hybrid (Intel® Xeon® processor and Intel MIC) configurations. Our SGEMM implementation delivers close to 1 Tflop/s on Intel's first implementation of Intel MIC architecture [codenamed Knight's Ferry (KNF)] silicon platform. Our implementation takes full advantage of multiple levels of memory hierarchy on MIC, and successfully utilizes up to 80% of its peak compute capability. Our LU factorization performance exceeds 570 Gflop/s including matrix transfer overhead when executed entirely on a KNF coprocessor. Our hybrid implementation, which offloads parts of LU processing to a dual-socket multi-core Intel Xeon processor X5680 host, delivers up to 772 Gflop/s. The novel aspect of our implementations is dynamic resource partitioning to improve load balance across the entire system.