Mapping the LU decomposition on a many-core architecture: challenges and solutions

Authors:
Ioannis E. Venetis;Guang R. Gao
Affiliations:
University of Patras, Patras, Greece;University of Delaware, Newark, DE, USA
Venue:
Proceedings of the 6th ACM conference on Computing frontiers
Year:
2009

Citing 10
Cited 6

On the problem of optimizing data transfers for complex memory systems

ICS '88 Proceedings of the 2nd international conference on Supercomputing
A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software (TOMS)
Improving register allocation for subscripted variables

PLDI '90 Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation
Software libraries for linear algebra computations on high performance computers

SIAM Review
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
Recursion leads to automatic variable blocking for dense linear-algebra algorithms

IBM Journal of Research and Development
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures

Proceedings of the 34th annual international symposium on Computer architecture
Optimization of dense matrix multiplication on IBM cyclops-64: challenges and experiences

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing

Optimized dense matrix multiplication on a many-core architecture

Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
Landing stencil code on Godson-T

Journal of Computer Science and Technology
Computing on multi-core platform: performance issues

Proceedings of the 2011 International Conference on Communication, Computing & Security
Locality optimization of stencil applications using data dependency graphs

LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
An efficient scheduler of RTOS for multi/many-core system

Computers and Electrical Engineering
Hardware---software optimizations of reconfigurable multi-core processors for floating-point computations of large sparse matrices

Journal of Real-Time Image Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recently, multi-core architectures with alternative memory subsystem designs have emerged. Instead of using hardware-managed cache hierarchies, they employ software-managed embedded memory. An open question is what programming and compiling methods are effective to exploit the performance potential of this new class of architectures. Using the LU decomposition as a case study, we propose three techniques that combined achieve a 27 times speedup on the IBM Cyclops-64 many-core architecture, compared to the parallel LU implementation from the SPLASH-2 benchmarks suite. Our first method allows adaptive load distribution, which maximizes load-balance among cores - this is important to leverage the potential of the next two methods. Secondly, we developed a method for register tiling that determines the optimal data tile parameters and maximizes data reuse according to register file size constraints. We demonstrate that our method is inherently general and that it should have a much broader applicability beyond Cyclops-64. Thirdly, we present a register allocation method for register tiled loop bodies. We evaluate its effect through hand-tuned Cyclops-64 assembly code and observe a 6-fold reduction in load/store operations. We achieve a performance of 19.17 and 27.50 GFlops with double-precision floating point numbers, for a 700x700 and a 1000x1000 matrix respectively.