Region-Based Prefetch Techniques for Software Distributed Shared Memory Systems

Authors:
Jie Cai;Peter E. Strazdins;Alistair P. Rendell
Affiliations:
-;-;-
Venue:
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Year:
2010

Citing 13
Cited 0

Hiding communication latency and coherence overhead in software DSMs

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Tradeoffs between false sharing and aggregation in software distributed shared memory

PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Data prefetching for software DSMs

ICS '98 Proceedings of the 12th international conference on Supercomputing
JIAJIA: A Software DSM System Based on a New Cache Coherence Protocol

HPCN Europe '99 Proceedings of the 7th International Conference on High-Performance Computing and Networking
Delphi: Predition-based Page Prefetching to Improve the Performance of Shared Virtual Memory Systems

PDPTA '02 Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications - Volume 1
Adaptive Prefetching Technique for Shared Virtual Memory

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Differential FCM: Increasing Value Prediction Accuracy by Improving Table Usage Efficiency

HPCA '01 Proceedings of the 7th International Symposium on High-Performance Computer Architecture
Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
ParADE: An OpenMP Programming Environment for SMP Cluster Systems

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
The design of MPI based distributed shared memory systems to support OpenMP on clusters

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Use of Cluster OpenMP with the Gaussian Quantum Chemistry Code: A Preliminary Performance Analysis

IWOMP '09 Proceedings of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism
Non-threaded and Threaded Approaches to MultiRail Communication with uDAPL

NPC '09 Proceedings of the 2009 Sixth IFIP International Conference on Network and Parallel Computing
Micro-benchmarks for cluster OpenMP implementations: memory consistency costs

IWOMP'08 Proceedings of the 4th international conference on OpenMP in a new era of parallelism

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although shared memory programming models show good programmability compared to message passing programming models, their implementation by page-based software distributed shared memory systems usually suffers from high memory consistency costs. The major part of these costs is inter-node data transfer for keeping virtual shared memory consistent. A good prefetch strategy can reduce this cost. We develop two prefetch techniques, TReP and HReP, which are based on the execution history of each parallel region. These techniques are evaluated using offline simulations with the NAS Parallel Benchmarks and the LINPACK benchmark. On average, TReP achieves an efficiency (ratio of pages prefetched that were subsequently accessed) of 96% and a coverage (ratio of access faults avoided by prefetches) of 65%. HReP achieves an efficiency of 91% but has a coverage of 79%. Treating the cost of an incorrectly prefetched page to be equivalent to that of a miss, these techniques have an effective page miss rate of 63% and 71% respectively. Additionally, these two techniques are compared with two well-known software distributed shared memory (sDSM) prefetch techniques, Adaptive++ and TODFCM. TReP effectively reduces page miss rate by 53% and 34% more, and HReP effectively reduces page miss rate by 62% and 43% more, compared to Adaptive++ and TODFCM respectively. As for Adaptive++, these techniques also permit bulk prefetching for pages predicted using temporal locality, amortizing network communication costs and permitting bandwidth improvement from multi-rail network interfaces.