Chores: enhanced run-time support for shared-memory parallel computing
ACM Transactions on Computer Systems (TOCS)
CHARM++: a portable concurrent object oriented system based on C++
OOPSLA '93 Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
Cilk: an efficient multithreaded runtime system
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Co-array Fortran for parallel programming
ACM SIGPLAN Fortran Forum
A fast Fourier transform compiler
Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation
The data locality of work stealing
Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures
OpenMP: An Industry-Standard API for Shared-Memory Programming
IEEE Computational Science & Engineering
Brook for GPUs: stream computing on graphics hardware
ACM SIGGRAPH 2004 Papers
Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques
ClawHMMER: A Streaming HMMer-Search Implementatio
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Design of High Performance MVAPICH2: MPI2 over InfiniBand
CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Compilation for explicitly managed memory hierarchies
Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Streamware: programming general-purpose multicore processors using streams
Proceedings of the 13th international conference on Architectural support for programming languages and operating systems
A tuning framework for software-managed memory hierarchies
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Certified Reasoning in Memory Hierarchies
APLAS '08 Proceedings of the 6th Asian Symposium on Programming Languages and Systems
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Compiler-directed memory management for heterogeneous MPSoCs
Journal of Systems Architecture: the EUROMICRO Journal
Programming the memory hierarchy revisited: supporting irregular parallelism in sequoia
Proceedings of the 16th ACM symposium on Principles and practice of parallel programming
Liszt: a domain specific language for building portable mesh-based PDE solvers
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Can traditional programming bridge the Ninja performance gap for parallel computing applications?
Proceedings of the 39th Annual International Symposium on Computer Architecture
Work-stealing with configurable scheduling strategies
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Algebraic program semantics for supercomputing
Theories of Programming and Formal Methods
Hi-index | 0.00 |
We present a platform independent runtime interface for moving data and computation through parallel machines with multi-level memory hierarchies. We show that this interface can be used as a compiler target and can be implemented easily and efficiently on a variety of platforms. The interface design allows us to compose multiple runtimes, achieving portability across machines with multiple memory levels. We demonstrate portability of programs across machines with two memory levels with runtime implementations for multi-core/SMP machines, the STI Cell Broadband Engine, a distributed memory cluster, and disk systems. We also demonstrate portability across machines with multiple memory levels by composing runtimes and running on a cluster of SMP nodes, out-of-core algorithms on a Sony Playstation 3 pulling data from disk, and a cluster of Sony Playstation 3's. With this uniform interface, we achieve good performance for our applications and maximize bandwidth and computational resources on these system configurations.