Fast allocation and deallocation of memory based on object lifetimes
Software—Practice & Experience
Cilk: an efficient multithreaded runtime system
PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Region-based memory management
Information and Computation
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
The memory fragmentation problem: solved?
Proceedings of the 1st international symposium on Memory management
Co-array Fortran for parallel programming
ACM SIGPLAN Fortran Forum
Hoard: a scalable memory allocator for multithreaded applications
ACM SIGPLAN Notices
Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Reconsidering custom memory allocation
OOPSLA '02 Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
A performance analysis of the Berkeley UPC compiler
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
Memory management with explicit regions
Memory management with explicit regions
Scalable lock-free dynamic memory allocation
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
X10: an object-oriented approach to non-uniform cluster computing
OOPSLA '05 Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
"MAMA!": a memory allocator for multithreaded architectures
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
McRT-Malloc: a scalable transactional memory allocator
Proceedings of the 5th international symposium on Memory management
Computer
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Sequoia: programming the memory hierarchy
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
The slab allocator: an object-caching kernel memory allocator
USTC'94 Proceedings of the USENIX Summer 1994 Technical Conference on USENIX Summer 1994 Technical Conference - Volume 1
Parallel Programmability and the Chapel Language
International Journal of High Performance Computing Applications
Decoupling method for parallel delaunay two-dimensional mesh generation
Decoupling method for parallel delaunay two-dimensional mesh generation
IEEE Transactions on Parallel and Distributed Systems
A programming model for deterministic task parallelism
Proceedings of the 2011 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
A performance model for X10 applications: what's going on under the hood?
Proceedings of the 2011 ACM SIGPLAN X10 Workshop
DRASync: distributed region-based memory allocation and synchronization
Proceedings of the 20th European MPI Users' Group Meeting
Hi-index | 0.00 |
Constantly increasing hardware parallelism poses more and more challenges to programmers and language designers. One approach to harness the massive parallelism is to move to task-based programming models that rely on runtime systems for dependency analysis and scheduling. Such models generally benefit from the existence of a global address space. This paper presents the parallel memory allocator of the Myrmics runtime system, in which multiple allocator instances organized in a tree hierarchy cooperate to implement a global address space with dynamic region support on distributed memory machines. The Myrmics hierarchical memory allocator is step towards improved productivity and performance in parallel programming. Productivity is improved through the use of dynamic regions in a global address space, which provide a convenient shared memory abstraction for dynamic and irregular data structures. Performance is improved through scaling on manycore systems without system-wide cache coherency. We evaluate the stand-alone allocator on an MPI-based x86 cluster and find that it scales well for up to 512 worker cores, while it can outperform Unified Parallel C by a factor of 3.7-10.7x.