Cedar Fortran and its compiler
CONPAR 90 Proceedings of the joint international conference on Vector and parallel processing
Implementation of a portable nested data-parallel language
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Fortran M: a language for modular parallel programming
Journal of Parallel and Distributed Computing
Cilk: an efficient multithreaded runtime system
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
SoftFLASH: analyzing the performance of clustered distributed virtual shared memory
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Modeling the effects of contention on application performance in multi-user environments
Modeling the effects of contention on application performance in multi-user environments
SIMPLE: a methodology for programming high performance algorithms on clusters of symmetric multiprocessors (SMPs)
Programming tools and environments
Communications of the ACM
Efficient run-time support for irregular block-structured applications
Journal of Parallel and Distributed Computing - Special issue on irregular problems in supercomputing applications
Application-level scheduling on distributed heterogeneous networks
Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Communication overlap in multi-tier parallel algorithms
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Multi-protocol active messages on a cluster of SMP's
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Fast Messages: Efficient, Portable Communication for Workstation Clusters and MPPs
IEEE Parallel & Distributed Technology: Systems & Technology
IEEE Transactions on Parallel and Distributed Systems
Multiple Data Parallelism with HPF and KeLP
HPCN Europe 1998 Proceedings of the International Conference and Exhibition on High-Performance Computing and Networking
Compositional C++: Compositional Parallel Programming
Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Modernization of Legacy Application Software
PARA '98 Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems
Flexible Communication Mechanisms for Dynamic Structured Applications
IRREGULAR '96 Proceedings of the Third International Workshop on Parallel Algorithms for Irregularly Structured Problems
Run-Time Support for Multi-tier Programming of Block-Structured Applications on SMP Clusters
ISCOPE '97 Proceedings of the Scientific Computing in Object-Oriented Parallel Environments
Scheduling From the Perspective of the Application
HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
A taxonomy of programming models for symmetric multiprocessors and SMP clusters
PMMP '95 Proceedings of the conference on Programming Models for Massively Parallel Computers
Minimizing overhead in parallel algorithms through overlapping communication/computation
Minimizing overhead in parallel algorithms through overlapping communication/computation
Portable Run-Time Support for Dynamic Object-Oriented Parallel Processing
Portable Run-Time Support for Dynamic Object-Oriented Parallel Processing
A parallel software infrastructure for dynamic block-irregular scientific calculations
A parallel software infrastructure for dynamic block-irregular scientific calculations
Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines
Scientific Programming
Library support for orthogonal processor groups
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
ORT: a communication library for orthogonal processor groups
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Performance Tradeoffs in Multi-tier Formulation of a Finite Difference Method
ICCS '01 Proceedings of the International Conference on Computational Sciences-Part I
Orthogonal Processor Groups for Message-Passing Programs
HPCN Europe 2001 Proceedings of the 9th International Conference on High-Performance Computing and Networking
Library support for hierarchical multi-processor tasks
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Solving irregularly structured problems based on distributed object model
Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
Improving the execution time of global communication operations
Proceedings of the 1st conference on Computing frontiers
SCALLOP: A Highly Scalable Parallel Poisson Solver in Three Dimensions
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Tlib-a library to support programming with hierarchical multi-processor tasks
Journal of Parallel and Distributed Computing
Overlapping communication and computation with OpenMP and MPI
Scientific Programming
Mixed task and data parallel executions in general linear methods
Scientific Programming
Deploying applications in multi-SAN SMP clusters
International Journal of Computational Science and Engineering
Algorithms for memory hierarchies: advanced lectures
Algorithms for memory hierarchies: advanced lectures
Hierarchical partitioning and dynamic load balancing for scientific computation
PARA'04 Proceedings of the 7th international conference on Applied Parallel Computing: state of the Art in Scientific Computing
Hi-index | 0.00 |
Hierarchically organized ensembles of shared memory multiprocessors possess a richer and more complex model of locality than previous generation multicomputers with single processor nodes. These dual-tier computers introduce many new factors into the programmer's performance model. We present a methodology for implementing block-structured numerical applications on dual-tier computers and a run-time infrastructure, called KeLP2, that implements the methodology. KeLP2 supports two levels of locality and parallelism via hierarchical SPMD control flow, run-time geometric meta-data, and asynchronous collective communication. KeLP applications can effectively overlap communication with computation under conditions where nonblocking point-to-point message passing fails to do so. KeLP's abstractions hide considerable detail without sacrificing performance and dual-tier applications written in KeLP consistently outperform equivalent single-tier implementations written in MPI. We describe the KeLP2 model and show how it facilitates the implementation of five block-structured applications specially formulated to hide communication latency on dual-tiered architectures. We support our arguments with empirical data from applications running on various single- and dual-tier multicomputers. KeLP2 supports a migration path from single-tier to dual-tier platforms and we illustrate this capability with a detailed programming example.