Hoard: a scalable memory allocator for multithreaded applications

Authors:
Emery D. Berger;Kathryn S. McKinley;Robert D. Blumofe;Paul R. Wilson
Affiliations:
Department of Computer Sciences, The University of Texas at Austin, Austin, Texas;Department of Computer Science, University of Massachusetts, Amherst, Massachusetts;Department of Computer Sciences, The University of Texas at Austin, Austin, Texas;Department of Computer Sciences, The University of Texas at Austin, Austin, Texas
Venue:
ACM SIGPLAN Notices
Year:
2000

Citing 13
Cited 24

Algorithms for parallel memory allocation

International Journal of Parallel Programming
Optimizing for parallelism and data locality

ICS '92 Proceedings of the 6th international conference on Supercomputing
Improving the cache locality of memory allocation

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Dynamic storage allocation on a multiprocessor

Dynamic storage allocation on a multiprocessor
Reducing false sharing on shared memory multiprocessors through compile time data transformations

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The memory fragmentation problem: solved?

Proceedings of the 1st international symposium on Memory management
Memory allocation for long-running server applications

Proceedings of the 1st international symposium on Memory management
Space-efficient scheduling of nested parallelism

ACM Transactions on Programming Languages and Systems (TOPLAS)
False Sharing and Spatial Locality in Multiprocessor Caches

IEEE Transactions on Computers
A Scalable and Efficient Storage Allocator on Shared Memory Multiprocessors

ISPAN '99 Proceedings of the 1999 International Symposium on Parallel Architectures, Algorithms and Networks
Hoard: A Fast, Scalable, and Memory-Efficient Allocator for Shared-MemoryMultiprocessors

Hoard: A Fast, Scalable, and Memory-Efficient Allocator for Shared-MemoryMultiprocessors
Non-compacting memory allocation and real-time garbage collection

Non-compacting memory allocation and real-time garbage collection
Properties of age-based automatic memory reclamation algorithms

Properties of age-based automatic memory reclamation algorithms

Sum-of-squares heuristics for bin packing and memory allocation

Journal of Experimental Algorithmics (JEA)
Stasis: flexible transactional storage

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Efficient dynamic heap allocation of scratch-pad memory

Proceedings of the 7th international symposium on Memory management
Branch-and-Bound interval global optimization on shared memory multiprocessors

Optimization Methods & Software - THE JOINT EUROPT-OMS CONFERENCE ON OPTIMIZATION, 4-7 JULY, 2007, PRAGUE, CZECH REPUBLIC, PART I
Optimizing transactions for captured memory

Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures
Memory management thread for heap allocation intensive sequential applications

Proceedings of the 10th workshop on MEmory performance: DEaling with Applications, systems and architecture
Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling

ICCD'09 Proceedings of the 2009 IEEE international conference on Computer design
Flat combining and the synchronization-parallelism tradeoff

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Parallel operations of sparse polynomials on multicores: I. multiplication and Poisson bracket

Proceedings of the 4th International Workshop on Parallel and Symbolic Computation
MapCG: writing parallel program portable between CPU and GPU

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Accelerating I/O Forwarding in IBM Blue Gene/P Systems

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Localizing defects in multithreaded programs by mining dynamic call graphs

TAIC PART'10 Proceedings of the 5th international academic and industrial conference on Testing - practice and research techniques
Parallelization of module network structure learning and performance tuning on SMP

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Dynamic cache contention detection in multi-threaded applications

Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
The myrmics memory allocator: hierarchical,message-passing allocation for global address spaces

Proceedings of the 2012 international symposium on Memory Management
Dynamically managed data for CPU-GPU architectures

Proceedings of the Tenth International Symposium on Code Generation and Optimization
ACDC: towards a universal mutator for benchmarking heap management systems

Proceedings of the 2013 international symposium on memory management
Detection of false sharing using machine learning

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles

ACM SIGOPS 24th Symposium on Operating Systems Principles
A lightweight infrastructure for graph analytics

Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles
Power-aware dynamic memory management on many-core platforms utilizing DVFS

ACM Transactions on Embedded Computing Systems (TECS) - Special Section on ESTIMedia'10
Revisiting memory management on virtualized environments

ACM Transactions on Architecture and Code Optimization (TACO)
Towards software performance engineering for multicore and manycore systems

ACM SIGMETRICS Performance Evaluation Review
KMA: A Dynamic Memory Manager for OpenCL

Proceedings of Workshop on General Purpose Processing Using GPUs

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel, multithreaded C and C++ programs such as web servers, database managers, news servers, and scientific applications are becoming increasingly prevalent. For these applications, the memory allocator is often a bottleneck that severely limits program performance and scalability on multiprocessor systems. Previous allocators suffer from problems that include poor performance and scalability, and heap organizations that introduce false sharing. Worse, many allocators exhibit a dramatic increase in memory consumption when confronted with a producer-consumer pattern of object allocation and freeing. This increase in memory consumption can range from a factor of P (the number of processors) to unbounded memory consumption.This paper introduces Hoard, a fast, highly scalable allocator that largely avoids false sharing and is memory efficient. Hoard is the first allocator to simultaneously solve the above problems. Hoard combines one global heap and per-processor heaps with a novel discipline that provably bounds memory consumption and has very low synchronization costs in the common case. Our results on eleven programs demonstrate that Hoard yields low average fragmentation and improves overall program performance over the standard Solaris allocator by up to a factor of 60 on 14 processors, and up to a factor of 18 over the next best allocator we tested.