Introduction to algorithms
A characterization of heaps and its applications
Information and Computation
Purely functional data structures
Purely functional data structures
Combining funnels: a dynamic approach to software combining
Journal of Parallel and Distributed Computing
Hoard: a scalable memory allocator for multithreaded applications
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Tera hardware-software cooperation
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Scalable lock-free dynamic memory allocation
Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Proceedings of the 2nd conference on Computing frontiers
Branch-and-Bound interval global optimization on shared memory multiprocessors
Optimization Methods & Software - THE JOINT EUROPT-OMS CONFERENCE ON OPTIMIZATION, 4-7 JULY, 2007, PRAGUE, CZECH REPUBLIC, PART I
Memory management thread for heap allocation intensive sequential applications
Proceedings of the 10th workshop on MEmory performance: DEaling with Applications, systems and architecture
The myrmics memory allocator: hierarchical,message-passing allocation for global address spaces
Proceedings of the 2012 international symposium on Memory Management
Introducing kernel-level page reuse for high performance computing
Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Towards software performance engineering for multicore and manycore systems
ACM SIGMETRICS Performance Evaluation Review
Hi-index | 0.00 |
While the high-performance computing world is dominated by distributed memory computer systems, applications that require random access into large shared data structures continue to motivate development of ever larger shared-memory parallel computers such as Cray's MTA and SGI's Altix systems.To support scalable application performance on such architectures, the memory allocator must be able to satisfy requests at a rate proportional to system size. For example, a 40 processor Cray MTA-2 can experience over 5000 concurrent requests, one from each of its 128 streams per processor. Cray's Eldorado, to be built upon the same network as Sandia's 10,000 processor Red Storm system, will sport thousands of multithreaded processors leading to hundreds of thousands of concurrent requests.In this paper, we present MAMA, a scalable shared-memory allocator designed to service any rate of concurrent requests. MAMA is distinguished from prior work on shared-memory allocators in that it employs software combining to aggregate requests serviced by a single heap structure: Hoard and MTA malloc necessitate repetition of the underlying heap data structures in proportion to processor count. Unlike Hoard, MAMA does not exploit processor-local data structures, limiting its applicability today to systems that sustain high utilization in the presence of global references such as Cray's MTA systems. We believe MAMA's relevance to other shared-memory systems will grow as they become increasingly multithreaded and, consequently, more tolerant of references to non-local memory.We show not only that MAMA scales on Cray MTA systems, but also that it delivers absolute performance competitive with allocators employing heap repetition. In addition, we demonstrate that performance of repetition-based allocators does not scale under heavy loads. We also argue more generally that methods using repetition alone to support concurrency are subject to an impractical tradeoff of scalability against space consumption: when scaled up to meet increasing concurrency demands, repetition-based allocators necessarily house unused space p2 quadratic in the number of processors p. Hierarchical structure may reduce this to p log p, but in building large-scale shared-memory parallel computers, unused memory more than linear in p is unacceptable. MAMA, in contrast, scales to arbitrarily large systems while consuming memory that increases only linearly with system and request size.MAMA is of both theoretical interest for its use of novel algorithmic techniques and practical importance as the concurrency upon which shared-memory performance depends continues to grow and multithreaded architectures emerge that are increasingly latency tolerant. While our work is a very recent contribution to memory allocation technology, MAMA already has been incorporated into production as the cornerstone for global memory allocation in Cray's multithreaded systems.