Implementation and performance of Munin
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Lazy release consistency for software distributed shared memory
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Disco: running commodity operating systems on scalable multiprocessors
ACM Transactions on Computer Systems (TOCS)
JIAJIA: A Software DSM System Based on a New Cache Coherence Protocol
HPCN Europe '99 Proceedings of the 7th International Conference on High-Performance Computing and Networking
Memory resource management in VMware ESX server
ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Adaptive main memory compression
ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
TreadMarks: distributed shared memory on standard workstations and operating systems
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Brazos: a third generation DSM system
NT'97 Proceedings of the USENIX Windows NT Workshop on The USENIX Windows NT Workshop 1997
The case for compressed caching in virtual memory systems
ATEC '99 Proceedings of the annual conference on USENIX Annual Technical Conference
Multi-execution: multicore caching for data-similar executions
Proceedings of the 36th annual international symposium on Computer architecture
Difference engine: harnessing memory redundancy in virtual machines
OSDI'08 Proceedings of the 8th USENIX conference on Operating systems design and implementation
Should we worry about memory loss?
ACM SIGMETRICS Performance Evaluation Review - Special issue on the 1st international workshop on performance modeling, benchmarking and simulation of high performance computing systems (PMBS 10)
Configurable range memory for effective data reuse on programmable accelerators
ACM Transactions on Design Automation of Electronic Systems (TODAES)
Hi-index | 0.00 |
Multicore processors have come to dominate the commodity market upon which many large scale systems are based. The number of cores is increasing with the speed of Moore's law and as a direct consequence, the memory available per core is decreasing, often severely limiting the problem size for programs running on such platforms. Thus, mechanisms to store memory efficiently in DRAM, increasing the effective capacity of DRAM, in a way that requires no reprogramming, would dramatically increase the benefits of multicore nodes for large scale systems. We observe that MPI programs replicate a significant amount of data across all processes. With multiple MPI tasks running on a single node, this replication leads to identical data residing in multiple locations in that node's DRAM, an ideal candidate for potential savings. We have found that most of the redundant data resides in the heap. Thus, smart memory allocation can remove this redundancy and increase the effective memory capacity. We present PSMalloc, a memory allocation library that keeps a single copy of identical pages from a set of MPI tasks. PSMalloc is implemented as a user level library that can be linked at runtime, avoiding changes in the application or the operating system. To the best of our knowledge, our work is the first that reduces physical memory footprints of MPI tasks in a multicore node without requiring kernel level modifications. We experiment with four MPI applications from the ASC Sequoia benchmark suite and show that we can achieve a reduction in memory footprint up to 22% and 11.18% in average.