PSMalloc: content based memory management for MPI applications

Authors:
Susmit Biswas;Diana Franklin;Timothy Sherwood;Frederic T. Chong;Bronis R. de Supinski;Martin Schulz
Affiliations:
University of California, Santa Barbara;University of California, Santa Barbara;University of California, Santa Barbara;University of California, Santa Barbara;Lawrence Livermore National Laboratory;Lawrence Livermore National Laboratory
Venue:
Proceedings of the 10th workshop on MEmory performance: DEaling with Applications, systems and architecture
Year:
2009

Citing 12
Cited 2

Implementation and performance of Munin

SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Lazy release consistency for software distributed shared memory

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Disco: running commodity operating systems on scalable multiprocessors

ACM Transactions on Computer Systems (TOCS)
JIAJIA: A Software DSM System Based on a New Cache Coherence Protocol

HPCN Europe '99 Proceedings of the 7th International Conference on High-Performance Computing and Networking
Memory resource management in VMware ESX server

ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Adaptive main memory compression

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
TreadMarks: distributed shared memory on standard workstations and operating systems

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Brazos: a third generation DSM system

NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997
The case for compressed caching in virtual memory systems

ATEC '99 Proceedings of the annual conference on USENIX Annual Technical Conference
Multi-execution: multicore caching for data-similar executions

Proceedings of the 36th annual international symposium on Computer architecture
Difference engine: harnessing memory redundancy in virtual machines

OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation

Should we worry about memory loss?

ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Configurable range memory for effective data reuse on programmable accelerators

ACM Transactions on Design Automation of Electronic Systems (TODAES)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multicore processors have come to dominate the commodity market upon which many large scale systems are based. The number of cores is increasing with the speed of Moore's law and as a direct consequence, the memory available per core is decreasing, often severely limiting the problem size for programs running on such platforms. Thus, mechanisms to store memory efficiently in DRAM, increasing the effective capacity of DRAM, in a way that requires no reprogramming, would dramatically increase the benefits of multicore nodes for large scale systems. We observe that MPI programs replicate a significant amount of data across all processes. With multiple MPI tasks running on a single node, this replication leads to identical data residing in multiple locations in that node's DRAM, an ideal candidate for potential savings. We have found that most of the redundant data resides in the heap. Thus, smart memory allocation can remove this redundancy and increase the effective memory capacity. We present PSMalloc, a memory allocation library that keeps a single copy of identical pages from a set of MPI tasks. PSMalloc is implemented as a user level library that can be linked at runtime, avoiding changes in the application or the operating system. To the best of our knowledge, our work is the first that reduces physical memory footprints of MPI tasks in a multicore node without requiring kernel level modifications. We experiment with four MPI applications from the ASC Sequoia benchmark suite and show that we can achieve a reduction in memory footprint up to 22% and 11.18% in average.