Matrix computations (3rd ed.)
FLAME: Formal Linear Algebra Methods Environment
ACM Transactions on Mathematical Software (TOMS)
Numerical Linear Algebra for High Performance Computers
Numerical Linear Algebra for High Performance Computers
Benchmarking GPUs to tune dense linear algebra
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Solving Dense Linear Systems on Graphics Processors
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Efficient backprojection-based synthetic aperture radar computation with many-core processors
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Efficient backprojection-based synthetic aperture radar computation with many-core processors
Scientific Programming - Selected Papers from Super Computing 2012
Hi-index | 0.00 |
Designing high-performance LU factorization for modern hybrid multi/many-core systems requires highly-tuned BLAS subroutines, hiding communication latency and balancing the load across devices of variable processing capabilities. In this paper we show how single-precision LU factorization is accelerated on Intel® MIC(Many Integrated Core) architecture in both native and hybrid (Intel® Xeon® processor and Intel MIC) configurations. Our SGEMM implementation delivers close to 1 Tflop/s on Intel's first implementation of Intel MIC architecture [codenamed Knight's Ferry (KNF)] silicon platform. Our implementation takes full advantage of multiple levels of memory hierarchy on MIC, and successfully utilizes up to 80% of its peak compute capability. Our LU factorization performance exceeds 570 Gflop/s including matrix transfer overhead when executed entirely on a KNF coprocessor. Our hybrid implementation, which offloads parts of LU processing to a dual-socket multi-core Intel Xeon processor X5680 host, delivers up to 772 Gflop/s. The novel aspect of our implementations is dynamic resource partitioning to improve load balance across the entire system.