Hoard: a scalable memory allocator for multithreaded applications

Authors:
Emery D. Berger;Kathryn S. McKinley;Robert D. Blumofe;Paul R. Wilson
Affiliations:
Department of Computer Sciences, The University of Texas at Austin, Austin, Texas;Department of Computer Science, University of Massachusetts, Amherst, Massachusetts;Department of Computer Sciences, The University of Texas at Austin, Austin, Texas;Department of Computer Sciences, The University of Texas at Austin, Austin, Texas
Venue:
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
Year:
2000

Citing 13
Cited 83

Algorithms for parallel memory allocation

International Journal of Parallel Programming
Optimizing for parallelism and data locality

ICS '92 Proceedings of the 6th international conference on Supercomputing
Improving the cache locality of memory allocation

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Dynamic storage allocation on a multiprocessor

Dynamic storage allocation on a multiprocessor
Reducing false sharing on shared memory multiprocessors through compile time data transformations

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The memory fragmentation problem: solved?

Proceedings of the 1st international symposium on Memory management
Memory allocation for long-running server applications

Proceedings of the 1st international symposium on Memory management
Space-efficient scheduling of nested parallelism

ACM Transactions on Programming Languages and Systems (TOPLAS)
False Sharing and Spatial Locality in Multiprocessor Caches

IEEE Transactions on Computers
A Scalable and Efficient Storage Allocator on Shared Memory Multiprocessors

ISPAN '99 Proceedings of the 1999 International Symposium on Parallel Architectures, Algorithms and Networks
Hoard: A Fast, Scalable, and Memory-Efficient Allocator for Shared-MemoryMultiprocessors

Hoard: A Fast, Scalable, and Memory-Efficient Allocator for Shared-MemoryMultiprocessors
Non-compacting memory allocation and real-time garbage collection

Non-compacting memory allocation and real-time garbage collection
Properties of age-based automatic memory reclamation algorithms

Properties of age-based automatic memory reclamation algorithms

Composing high-performance memory allocators

Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation
Mostly lock-free malloc

Proceedings of the 3rd international symposium on Memory management
Reconsidering custom memory allocation

OOPSLA '02 Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources

Proceedings of the General Track: 2002 USENIX Annual Technical Conference
Controlling fragmentation and space consumption in the metronome, a real-time garbage collector for Java

Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems
Improving server software support for simultaneous multithreaded processors

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
A C++ Pooled, Shared Memory Allocator for Simulator Development

ANSS '04 Proceedings of the 37th annual symposium on Simulation
Scalable lock-free dynamic memory allocation

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Myths and realities: the performance impact of garbage collection

Proceedings of the joint international conference on Measurement and modeling of computer systems
SDRAM-Energy-Aware Memory Allocation for Dynamic Multi-Media Applications on Multi-Processor Platforms

DATE '03 Proceedings of the conference on Design, Automation and Test in Europe - Volume 1
Performance Evaluation of Task Pools Based on Hardware Synchronization

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
NUMA-Aware Java Heaps for Server Applications

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
EMPS: An Environment for Memory Performance Studies

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 10 - Volume 11
Garbage collection without paging

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
The KaffeOS Java runtime system

ACM Transactions on Programming Languages and Systems (TOPLAS)
Optimistic intra-transaction parallelism on chip multiprocessors

VLDB '05 Proceedings of the 31st international conference on Very large data bases
A locality-improving dynamic memory allocator

Proceedings of the 2005 workshop on Memory system performance
"MAMA!": a memory allocator for multithreaded architectures

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
McRT-STM: a high performance software transactional memory system for a multi-core runtime

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
McRT-Malloc: a scalable transactional memory allocator

Proceedings of the 5th international symposium on Memory management
Scalable locality-conscious multithreaded memory allocation

Proceedings of the 5th international symposium on Memory management
DieHard: probabilistic memory safety for unsafe languages

Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation
Comprehensively and efficiently protecting the heap

Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
Efficient pattern mining on shared memory systems: implications for chip multiprocessor architectures

Proceedings of the 2006 workshop on Memory system performance and correctness
Transactions with isolation and cooperation

Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
Software performance tuning of software product family architectures: Two case studies in the real-time embedded systems domain

Journal of Systems and Software
Performance of memory reclamation for lockless synchronization

Journal of Parallel and Distributed Computing
Incrementally parallelizing database transactions with thread-level speculation

ACM Transactions on Computer Systems (TOCS)
A compacting real-time memory management system

ATC'08 USENIX 2008 Annual Technical Conference on Annual Technical Conference
Experimenting with parallelism for the instantiation of ASP programs

Journal of Algorithms
Memory Allocation Tracing with VampirTrace

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part II
MPC: A Unified Parallel Runtime for Clusters of NUMA Machines

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
High Level Thread-Based Competitive Or-Parallelism in Logtalk

PADL '09 Proceedings of the 11th International Symposium on Practical Aspects of Declarative Languages
First-aid: surviving and preventing memory management bugs during production runs

Proceedings of the 4th ACM European conference on Computer systems
A study of memory management for web-based applications on multicore processors

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Algorithm, software, and hardware optimizations for Delaunay mesh generation on simultaneous multithreaded architectures

Journal of Parallel and Distributed Computing
SPARTAN: A software tool for Parallelization Bottleneck Analysis

IWMSE '09 Proceedings of the 2009 ICSE Workshop on Multicore Software Engineering
MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
A Software Transactional Memory Service for Grids

ICA3PP '09 Proceedings of the 9th International Conference on Algorithms and Architectures for Parallel Processing
Grace: safe multithreaded programming for C/C++

Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications
CoreDet: a compiler and runtime system for deterministic multithreaded execution

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Evaluation of AMD's advanced synchronization facility within a complete transactional memory stack

Proceedings of the 5th European conference on Computer systems
STAPL: an adaptive, generic parallel C++ library

LCPC'01 Proceedings of the 14th international conference on Languages and compilers for parallel computing
SCSTallocator: sized and call-site tracing-based shared memory allocator for false sharing reduction in page-based DSM systems

IDEAL'07 Proceedings of the 8th international conference on Intelligent data engineering and automated learning
Parallelizing tableaux-based description logic reasoning

OTM'07 Proceedings of the 2007 OTM Confederated international conference on On the move to meaningful internet systems - Volume Part II
Parallelization of bulk operations for STL dictionaries

Euro-Par'07 Proceedings of the 2007 conference on Parallel processing
Z-rays: divide arrays and conquer speed and flexibility

PLDI '10 Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation
Simplifying concurrent algorithms by exploiting hardware transactional memory

Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures
Optimal resource management for a model driven LTE protocol stack on a multicore platform

Proceedings of the 8th ACM international workshop on Mobility management and wireless access
Mnemosyne: lightweight persistent memory

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
NV-Heaps: making persistent objects fast and safe with next-generation, non-volatile memories

Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems
Optimizing hybrid transactional memory: the importance of nonspeculative operations

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
A highly-efficient wait-free universal construction

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Cache index-aware memory allocation

Proceedings of the international symposium on Memory management
ALTER: exploiting breakable dependences for parallelization

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
SecureME: a hardware-software approach to full system security

Proceedings of the international conference on Supercomputing
Parallelization of the Lanczos algorithm on multi-core platforms

ICDCN'10 Proceedings of the 11th international conference on Distributed computing and networking
Dthreads: efficient deterministic multithreading

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
SHERIFF: precise detection and automatic mitigation of false sharing

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Fast and scalable rendezvousing

DISC'11 Proceedings of the 25th international conference on Distributed computing
Thread Tranquilizer: Dynamically reducing performance variation

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Allocating memory in a lock-free manner

ESA'05 Proceedings of the 13th annual European conference on Algorithms
Optimizing c multithreaded memory management using thread-local storage

CC'05 Proceedings of the 14th international conference on Compiler Construction
Revisiting the combining synchronization technique

Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming
Shredder: GPU-accelerated incremental storage and computation

FAST'12 Proceedings of the 10th USENIX conference on File and Storage Technologies
Experiences of performance tuning software product family architectures using a scenario-driven approach

EASE'06 Proceedings of the 10th international conference on Evaluation and Assessment in Software Engineering
Parallel memory defragmentation on a GPU

Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
Memory management for many-core processors with software configurable locality policies

Proceedings of the 2012 international symposium on Memory Management
Delegation and nesting in best-effort hardware transactional memory

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Memory-mapping support for reducer hyperobjects

Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
A template library to integrate thread scheduling and locality management for NUMA multiprocessors

HotPar'12 Proceedings of the 4th USENIX conference on Hot Topics in Parallelism
Using managed runtime systems to tolerate holes in wearable memories

Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation
Distributed queues in shared memory: multicore performance and scalability through quantitative relaxation

Proceedings of the ACM International Conference on Computing Frontiers
DRASync: distributed region-based memory allocation and synchronization

Proceedings of the 20th European MPI Users' Group Meeting
Introducing kernel-level page reuse for high performance computing

Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness
OOPSLA 2002: Reconsidering custom memory allocation

ACM SIGPLAN Notices - Supplemental issue
Scalable SIMD-parallel memory allocation for many-core machines

The Journal of Supercomputing
PREDATOR: predictive false sharing detection

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
Efficient deterministic multithreading without global barriers

Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
A hierarchical parallel discrete event simulation kernel for multicore platform

Cluster Computing
Analyzing the performance of SMP memory allocators with iterative MapReduce applications

Parallel Computing
Log-structured memory for DRAM-based storage

FAST'14 Proceedings of the 12th USENIX conference on File and Storage Technologies
FaRM: fast remote memory

NSDI'14 Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel, multithreaded C and C++ programs such as web servers, database managers, news servers, and scientific applications are becoming increasingly prevalent. For these applications, the memory allocator is often a bottleneck that severely limits program performance and scalability on multiprocessor systems. Previous allocators suffer from problems that include poor performance and scalability, and heap organizations that introduce false sharing. Worse, many allocators exhibit a dramatic increase in memory consumption when confronted with a producer-consumer pattern of object allocation and freeing. This increase in memory consumption can range from a factor of P (the number of processors) to unbounded memory consumption.This paper introduces Hoard, a fast, highly scalable allocator that largely avoids false sharing and is memory efficient. Hoard is the first allocator to simultaneously solve the above problems. Hoard combines one global heap and per-processor heaps with a novel discipline that provably bounds memory consumption and has very low synchronization costs in the common case. Our results on eleven programs demonstrate that Hoard yields low average fragmentation and improves overall program performance over the standard Solaris allocator by up to a factor of 60 on 14 processors, and up to a factor of 18 over the next best allocator we tested.