An approach to locality-conscious load balancing and transparent memory hierarchy management with a global- address-space parallel programming model

Authors:
Sriram Krishnamoorthy;Umit Catalyurek;Jarek Nieplocha;P. Sadayappan
Affiliations:
Dept. of Computer Science and Engineering, The Ohio State University;Dept. of Biomedical Informatics, The Ohio State University;Pacific Northwest National Laboratory;Dept. of Computer Science and Engineering, The Ohio State University
Venue:
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Year:
2006

Citing 15
Cited 1

CHARM++: a portable concurrent object oriented system based on C++

OOPSLA '93 Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications
Multilevel hypergraph partitioning: application in VLSI domain

DAC '97 Proceedings of the 34th annual Design Automation Conference
Level 3 basic linear algebra subprograms for sparse matrices: a user-level interface

ACM Transactions on Mathematical Software (TOMS)
Hypergraph-Partitioning-Based Decomposition for Parallel Sparse-Matrix Vector Multiplication

IEEE Transactions on Parallel and Distributed Systems
Global arrays: a portable "shared-memory" programming model for distributed memory computers

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Recursive Array Layouts and Fast Matrix Multiplication

IEEE Transactions on Parallel and Distributed Systems
ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems

Proceedings of the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
A high-level approach to synthesis of high-performance codes for quantum chemistry

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Disk Resident Arrays: An Array-Oriented I/O Library for Out-Of-Core Computations

FRONTIERS '96 Proceedings of the 6th Symposium on the Frontiers of Massively Parallel Computation
Cilk: efficient multithreaded computing

Cilk: efficient multithreaded computing
A Multi-Platform Co-Array Fortran Compiler

Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit

International Journal of High Performance Computing Applications
An extensible global address space framework with decoupled task and data abstractions

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Data and computation abstractions for dynamic and irregular computations

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Implementation of parallel numerical algorithms using hierarchically tiled arrays

LCPC'04 Proceedings of the 17th international conference on Languages and Compilers for High Performance Computing

Brief announcement: locality-aware load balancing for speculatively-parallelized irregular applications

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures

Quantified Score

Hi-index	0.00

Visualization

Abstract

The development of efficient parallel out-of-core applications is often tedious, because of the need to explicitly manage the movement of data between files and data structures of the parallel program. Several large-scale applications require multiple passes of processing over data too large to fit in memory, where significant concurrency exists within each pass. This paper describes a global-address-space framework for the convenient specification and efficient execution of parallel out-of-core applications operating on block-sparse data. The programming model provides a global view of block-sparse matrices and a mechanism for the expression of parallel tasks that operate on blocksparse data. The tasks are automatically partitioned into phases that operate on memory-resident data, and mapped onto processors to optimize load balance and data locality. Experimental results are presented that demonstrate the utility of the approach.